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We propose a computational framework named iterative local 
adaptive majorize-minimization (I-LAMM) to simultaneously control 
algorithmic complexity and statistical error when fitting high dimen¬ 
sional models. I-LAMM is a two-stage algorithmic implementation of 
the local linear approximation to a family of folded concave penal¬ 
ized quasi-likelihood. The first stage solves a convex program with a 
crude precision tolerance to obtain a coarse initial estimator, which 
is further refined in the second stage by iteratively solving a sequence 
of convex programs with smaller precision tolerances. Theoretically, 
we establish a phase transition: the first stage has a sublinear itera¬ 
tion complexity, while the second stage achieves an improved linear 
rate of convergence. Though this framework is completely algorith¬ 
mic, it provides solutions with optimal statistical performances and 
controlled algorithmic complexity for a large family of nonconvex 
optimization problems. The iteration effects on statistical errors are 
clearly demonstrated via a contraction property. Our theory relies 
on a localized version of the sparse/restricted eigenvalue condition, 
which allows us to analyze a large family of loss and penalty func¬ 
tions and provide optimality guarantees under very weak assumptions 
(For example, I-LAMM requires much weaker minimal signal strength 
than other procedures). Thorough numerical results are provided to 
support the obtained theory. 


1. Introduction. Modem data acquisitions routinely measure massive 
amounts of variables, which can be much larger than the sample size, making 


statistical inference an ill-posed problem. For inferential tractability and in- 


terpretability, one common approach is to exploit the penalized M-estimator 
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where £(•) is a smooth loss function, is a sparsity-inducing penalty 

with a regularization parameter A. Our framework encompasses the square 
loss, logistic loss, Gaussian graphical model negative log-likelihood loss, Hu¬ 
ber loss, and the family of folded concave penalties [9]. Finding optimal sta¬ 
tistical procedures with controlled computational complexity characterizes 
the efforts of high-dimensional statistical learning in the last two decades. 
This paper makes an important leap toward this grand challenge by propos¬ 
ing a general algorithmic strategy for solving (1.1) even when TZ\{(3) is 
nonconvex. 

A popular choice of TZ\((3) is the Lasso penalty [25], a convex penalty. 
Though a large literature exists on understanding the theory of penalized 
M-estimators with convex penalties [8, 4, 26, 22], it has been well known 
[9, 33] that the convex penalties introduce non-negligible estimation bi¬ 
ases. In addition, the algorithmic issues for finding global minimizer are 
rarely addressed. To eliminate the estimation bias, a family of folded-concave 
penalties was introduced by [9], which includes the smooth clipped absolute 
deviation (SCAD) [9], minimax concave penalty (MCP) [29], and capped 
^i-penalty [32]. Compared to their convex counterparts, these nonconvex 
penalties eliminate the estimation bias and attain more refined statistical 
rates of convergence. However, it is more challenging to analyze the theoret¬ 
ical properties of the resulting estimators due to nonconvexity of the penalty 
functions. Existing work on nonconvex penalized M-estimators treats the 
statistical properties and practical algorithms separately. On one hand, sta¬ 
tistical properties are established for the hypothetical global optimum (or 
some local minimum), which is usually unobtainable by any practical algo¬ 
rithm in polynomial time. For example, [9] showed that there exists a local 
solution that possesses an oracle property; [15] and [11] showed that the 
oracle estimator is a local minimizer with high probability. Later on, [16] 
and [30] proved that the global optimum achieves the oracle property under 
certain conditions. Nevertheless, none of these paper specifies an algorithm 
to find the desired solution. More recently, [20, 22, 1] develop a projected 
gradient algorithm with desired statistical guarantees. However, they need 
to modify the estimating procedures to include an additional G-ball con¬ 
straint, ||/3||i < R, which depends on the unknown true parameter. On the 
other hand, practitioners have developed numerous heuristic algorithms for 
nonconvex optimization problems, but without theoretical guarantees. One 
such example is the coordinate optimization strategy studied in [6] and [13]. 

So there is a gap between theory and practice: What is actually computed 
is not the same as what has been proved. To bridge this gap, we propose 
an iterative local adaptive majorize-minimization (I-LAMM) algorithm for 
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fitting high dimensional statistical models. Unlike most existing methods, 
which are mainly motivated from a statistical perspective and ignore the 
computational consideration, I-LAMM is both algorithmic and statistical: it 
computes an estimator within polynomial time and achieves optimal statis¬ 
tical accuracy for this estimator. In particular, I-LAMM obtains estimators 
with the strongest statistical guarantees for a wide family of loss functions 
under the weakest possible assumptions. Moreover, the statistical properties 
are established for the estimators computed exactly by our algorithm, which 
is designed to control the cost of computing resources. Compared to existing 
works [20, 22, 1], our method does not impose any constraint that depends 
on the unknown true parameter. 

Inspired by the local linear approximation to the folded concave penalty 
[34] , we use I-LAMM to solve a sequence of convex programs up to a prefixed 
optimization precision 

(1.2) ijm{£(/3)+K(A( M )0/3)}, for £ = 1,..., T, 

where A^ -1 ) = (Aw(|/3^ ^|),..., Aw(|^|)) T , (3^ is an approximate so¬ 
lution to the Uli optimization problem in (1.2), w(-) is a weighting func¬ 
tion, IZ(-) is a decomposable convex penalty function, and ‘0’ denotes the 
Hadamard product. In this paper, we mainly consider 1Z{/3) = ||/3||i, though 
our theory is general. The weighting function corresponds to the derivative 
of the folded concave penalty in [9], [34] and [11]. 

In particular, the I-LAMM algorithm obtains a crude initial estimator /3 lb 
and further solves the optimization problem (1.2) for l> 2 with established 
algorithmic and statistical properties. This provides theoretical insights on 
how fast the algorithm converges and how much computation is needed, 
as well as the desired statistical properties of the obtained estimator. The 
whole procedure consists of T convex programs, each only needs to be solved 
approximately to control the computational cost. Under mild conditions, we 
show that only log(A- v /?4) steps are needed to obtain the optimal statistical 
rate of convergence. Even though I-LAMM solves approximately a sequence 
of convex programs, the solution enjoys the same optimal statistical property 
of the unobtainable global optimum for the folded-concave penalized regres¬ 
sion. The adaptive stopping rule for solving each convex program in (1.2) 
allows us to control both computational costs and statistical errors. Figure 
1 provides a geometric illustration of the I-LAMM procedure. It contains a 
contraction stage and a tightening stage as described below. 

* Contraction Stage: In this stage (£ = 1), we approximately solve a con¬ 
vex optimization problem (1.2), starting from any initial value (3^\ and 
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Fig 1. Geometric illustration of the contraction property. The contraction stage produces 
an initial estimator, starting from any initial value that falls in the contraction region 
which secures the tightening stage to enjoy optimal statistical and computational rates of 
convergence. The tightening stage adaptively refines the contraction estimator till it enters 
the optimal region, which is stated in (1.3). Here X is a regularization parameter, s the 
number of nonzero coefficients in (3* and n the sample size. 


terminate the algorithm as long as the approximate solution enters a de¬ 
sired contraction region which will be characterized in Section 2.3. The 
obtained estimator is called the contraction estimator, which is very crude 
and only serves as initialization. 

* Tightening Stage: This stage involves multiple tightening steps (i > 2). 
Specifically, we iteratively tighten the contraction estimator by solving 
a sequence of convex programs. Each step contracts its initial estimator 
towards the true parameter until it reaches the optimal region of con¬ 
vergence. At that region, further iteration does not improve statistical 
performance. See Figure 1. More precisely, we will show the following 
contraction property 

(1.3) \\pW-p*\\ 2 <J^ + 6.\\pM-p*\\ 2 for £>2, 

where (3* is the true regression coefficient, <5 e (0,1) a prefixed contrac¬ 
tion parameter and \Jsjn the order of statistical error. Tightening helps 
improve the accuracy only when ||/3^ _1 ) — (3* ||2 dominates the statistical 
error. The iteration effect is clearly demonstrated. Since (3^ is only used 
to create an adaptive weight for , we can control the iteration com¬ 

plexity by solving each subproblem in (1.2) approximately. What differs 
from the contraction stage is that the initial estimators in the tightening 
stage are already in the contraction region, making the optimization algo¬ 
rithm enjoy geometric rate of convergence. This allows us to rapidly solve 
(1.2) with small optimization error. 

* (Phase Transition in Algorithmic Convergence) In the contraction stage {£ = 
1), the optimization problem is not strongly convex and therefore our 
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algorithm has only a sublinear convergence rate. Once the solution enters 
the contraction region, we will show that the feasible solutions are sparse 
and the objective function is essentially ’low’-dimensional and becomes 
(restricted) strongly convex and smooth in that region. Therefore, our 
algorithm has a linear convergence rate for i> 1. Indeed, this holds even 
for t = 1, which admits a sublinear rate until it enters into the contraction 
region and enjoys a linear rate of convergence after that. See Figure 2. But 
this estimator (for l = 1) is the estimator that corresponds to LASSO 
penalty, not folded concave penalty that we are looking for. 


Computational Rate for Constant Correlation Design 



Fig 2. Computational rate of convergence in each stage for the simulation experiment spec¬ 
ified in case % in Example 6.1. The x-axis is the iteration count k within the ith subproblem. 
The phase transition from sublinear rate to liner rate of algorithmic convergence is clearly 
seen once the iterations enter the contraction region. Here /3 (£) is the global minimizer 
of the ith optimization problem in (1.2) and / 3 ILD ^ s fcfo it era tion (see Figure 3). For 
i = 1 , the initial estimation sequence has sublinear rate and once the solution sequence en¬ 
ters the contraction region, it becomes linear convergent. For l > 2, the algorithm achieves 
linear rate, since all estimators are in the contraction region. 

This paper makes four major contributions. First, I-LAMM offers an algo¬ 
rithmic approach to obtain the optimal estimator with controlled computing 
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resources. Second, compared to the existing literature, our method requires 
weaker conditions due to a novel localized analysis of sparse learning prob¬ 
lems. Specifically, our method does not need the extra ball constraint as 
in [20] and [28], which is an artifact of their proofs. Third, our compu¬ 
tational framework takes the approximate optimization error into analysis 
and provides theoretical guarantees for the estimator that is computed by 
the algorithm. Fourth, our method provides new theoretical insights about 
the adaptive lasso and folded-concave penalized regression. In particular, 
we bridge these two methodologies together using a unified framework. See 
Section 3.2 for more details. 

The rest of this paper proceeds as follows. In Section 2, we introduce 
I-LAMM and its implementation. Section 3 is contributed to new insights 
into existing methods for high dimensional regression. In Section 4, we intro¬ 
duce both the localized sparse eigenvalue and localized restricted eigenvalue 
conditions. Statistical property and computational complexity are then pre¬ 
sented. In Section 5, we outline the key proof strategies. Numerical sim¬ 
ulations are provided to evaluate the proposed method in Section 6. We 
conclude by discussions in Section 7. All the proofs are postponed to the 
online supplement. 

Notation: For u = (u\,U 2 , ■ ■ ■ ,Ud) T £ R d , we define the £q-norm of u by 
HI, = (£-=iN where q G [l,oo). Let ||u|| m i n = min{rtj : 1 < 
j < d}. For a set S, let |«S| denote its cardinality. We define the t'o-pseudo 
norm of u as ||u||o = |supp(u)|, where supp(u) = {j : Uj ^ 0}. For an 
index set X C { 1,..., d} , uj G M. d is defined to be the vector whose i-th 
entry is equal to Ui if i GX and zero otherwise. Let A = [(H.-j} £ M. dxd . For 
<7 > 1, we define ||A|| g as the matrix operator g-norm of A. For index sets 
X, J C {1,..., d}, we define A x,j G M. dxd to be the matrix whose th 
entry is equal to a. t ,j if isX and j E.X, and zero otherwise. We use sign(x) 
to denote the sign of x: sign(x) = x/\x\ if x / 0 and sign(x) = 0 otherwise. 
For two functionals /(n, d, s) and g(n, d, s ), we denote f(n , d, s) > g(n , d, s) 
if f(n, d , s) > Cg(n, d, s ) for a constant C; /(n, d, s ) < g(n, d, s ) otherwise. 

2. Methodology. In this paper, we assume that the loss function £(•) G 
J~Ci a family of general convex loss functions specified in Appendix A. 

2.1. Local Adaptive Majorize-Minimization. Recall that the estimators 
are obtained by solving a sequence of convex programs in (1.2). We require 
the function w(-) used therein to be taken from the tightening function class 
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T, defined as 

(2.1) T = jw(-) £ M. : w(fi) <w(t 2 ) for all fi >£2 > 0, 

0< w(f) <1 if t > 0. w(f) = 0 if t < 01. 

To fix ideas, we take 1Z\((3) in (1.1) to be ^Cj=iPA(|/3?|)- where p\(-) is a 
folded concave penalty [9] such as the SCAD or MCP. As discussed in [9], 
the penalized likelihood function in (1.1) is folded concave with respect to 
/3, making it difficult to be maximized. We propose to use the adaptive local 
linear approximation (adaptive LLA) to the penalty function [34, 12] and 
approximately solve 

d 

(2.2) argmin {£(0) + Xy*(l Pf 1) DI/ 5 jl}> for 1 < t < T, 

P 3 =1 

where (3^ ^ is the jth component of and (3^ can be an arbitrary bad 

initial value: j3 ^ = 0, for example. If we assume that w(-) = A G 

T, such as the SCAD or MCP, then the adaptive LLA algorithm can be 
regarded as a special case of our general formulation (1.2). Note that the 
LLA algorithm with ^-penalty (q < 1) is not covered by our algorithm since 
its derivative is unbounded at the origin and thus A -1 ^^-) 0 T. The latter 
creates a zero-absorbing state: once a component is shrunk to zero, it will 
remain zero throughout the remaining iterations, as noted in [10]. Of course, 
we can truncate the loss derivative of the loss function to resolve this issue. 

We now propose a local adaptive majorize-minimization (LAMM) prin¬ 
cipal, which will be repeatedly called to practically solve the optimization 
problem (2.2). We first review the majorize-minimization (MM) algorithm. 
To minimize a general function /(/3), at a given point (3^ k \ MM majorizes 
it by g(/3 1/3^), which satisfies 

gW {k) ) > m and g (pW\pW) = f(f3W) 

and then compute /3( fc+1 ) = argming {g(/3\l3^)} [17, 14]. The objective 
value of such an algorithm is non-increasing in each step, since 

(2.3) /(/3 (fc+1) ) m <° r ' 5 (/3 {fc+1 ) | /#)) T 5 (/3 (fc) | /3 (fc) ) =' /(/3 (fc) ). 

An inspection of the above arguments shows that the majorization require¬ 
ment is not necessary. It requires only the local property: 

(2.4) /(/3 (fc+1) ) < g(p ik+1) \pW) and g(P^\(3^) = /(/#>) 
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for the inequalities in (2.3) to hold. 

Inspired by the above observation, we locally majorize (2.2) at the Mi 
step. It is similar to the iteration steps used in the (proximal) gradient 
method [5, 23]. Instead of computing and storing a large Hessian matrix 
as in [34], we majorize C((3) in (2.2) at by an isotropic quadratic 

function 


£(/^- 1} ) + (V£G9 ( * -1) ),/3- ) + ^||/3 - P^Wl 

where V is used to denote derivative. By Taylor’s expansion, it suffices to 
take (f) that is no smaller than the largest eigenvalue of V 2 £(/3^ -1 )). More 
importantly, the isotropic form also allows a simple analytic solution to the 
subsequent majorized optimization problem: 

(2.5) argmin j£(/3^ _1 )) + (V£(/3^ -1 )), (3 — 

+ t;\\p - 

3 =1 

With A^” 1 ) = (p / A (l^ _1) |)> • ■ ■ 1 ’ > I)) T > ^ ea sy to show that (2.5) 

is minimized at 

(3^ = = S(pV-V - </>- 1 V£(0V~%tl>- 1 \V- 1 ')), 

where S(x, A) is the soft-thresholding operator, defined by £(x, A) = (sign(ay) 
max{|a;_j| — Ay, 0}). The simplicity of this updating rule is due to the fact 
that (2.5) is an unconstrained optimization problem. This is not the case in 
[20] and [28]. 

However, finding the value of <p> ||V 2 £(/3^ _1 ))||2 is not an easy task in 
computation. To avoid storing and computing the largest eigenvalue of a big 
matrix, we now state the LAMM algorithm, thanks to the local requirement 
(2.4). The basic idea of LAMM is to start from a very small isotropic pa¬ 
rameter cj >o and then successfully inflate </> by a factor > 1 (say, 2). If the 
solution satisfies (2.4), we stop this part of the algorithm, which will make 
the target value non-increasing. Since after the kth iteration, <f> = 
there always exists a k such that it is no larger than ||V 2 £(/3^ _1 ))||2- In this 
manner, the LAMM algorithm will find a smallest iteration to make (2.4) 
hold. 

Specifically, our proposed LAMM algorithm to solve (2.5) at /3^ -1 ) begins 
with (j)=(j) o, say 10 -6 , iteratively increases cj) by a factor of 7„>1 inside the 
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Algorithm 1 The LAMM algorithm in the fcth iteration of the £th tight¬ 
ening subproblem. 

1: Algorithm: {/3^ k \ LAMM^" 1 ’, , 0 O , <f> (e ’ k ~ 1) ) 

2: Input: A^ 1 ’, f3^ k ~ 1 \ 0 O , 0 (A ' S_1) 

3: Initialize: <— max{0o, 

4: Repeat 

5: pW ^T x ^ n ^, k) {& l ’ k ~V) 

6 : If then (j>^ k) <- 7u ^’ fc) 

7: Until F{P^ k \\ < ' l ~ X) ) < O^’* 0 ; /3 (£ ’ fc - 1) ) 

8 : Return 


fth step of optimization, and computes 

(3^) = with ^ = 7^-Vo, P m = ^- 1} , 

until the local property (2.4) holds. In our context, LAMM stops when 
*a«-D ,*(/,*) (/3 (£,1) , P m )>F(/3^ 1 \X^), 
where F(/3, A^ -1 )) = C{(3) + Y^=x and 

*x(e-V,<l,vxMP m ) = £{P m ) + <V£(/^’°)),/3 - /3^°)> 

+ ^ll/3-/3 ( “ ) lli + EA''- I| |ftt 

7=1 

Inspired by [23], to accelerate LAMM within the next majorizing step, we 
keep track of the sequence {4>^ £ ’ k ^}i : k and set = max{(/> 0 ,7 U d}, 

with the convention that 4>fj) = (fri-i an d <^o = 00) in which (f>£-i is the 
isotropic parameter corresponding to the solution (3^ l \ This is summarized 
in Algorithm 1 with a generic initial value. 

The LAMM algorithm solves only one local majorization step. It corre¬ 
sponds to moving one horizontal step in Figure 3. To solve (2.2), we need to 
use LAMM iteratively, which we shall call the iterative LAMM (I-LAMM) 
algorithm, and compute a sequence of solutions ( 3 ^’ k ) us i n g the initial value 
/3 ^ ,k ~ 1 ). Figure 3 depicts the schematics of our algorithm: the £th row cor¬ 
responds to solving the tth. subproblem in (2.2) approximately, beginning 
by computing the adaptive weight The number of iterations needed 

within each row will be discussed in the sequel. 

2.2. Stopping Criterion. I-LAMM recognizes that the exact solutions to 
(2.2) can never be achieved in practice with algorithmic complexity control. 
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Instead, in the tth optimization subproblem, we compute the approximate 
solution, f3 ^>, up to an optimization error e, the choice of which will be dis¬ 
cussed in next subsection. To calculate this approximate solution, starting 
from the initial value = {3^~ l \ the algorithm constructs a solution se¬ 
quence {/3^ ,k ^}k=i,2,--- using the introduced LAMM algorithm. See Figure 3. 

We then introduce a stopping criterion for the I-LAMM algorithm. From 
optimization theory (Section 5.5 in [5]), we know that any exact solution (3^ 
to the £th subproblem in (2.2) satisfies the first order optimality condition: 

(2.6) V£(/3 W ) + A^- 1} © £ = 0, for some £ G d||3 l<) || 1 € [-1,1]', 

where d is used to indicate the subgradient operator. The set of subgradients 
of a function / : —> M at a point xo, denoted as df(x o), is defined as 

the collection of vectors, £, such that /(x) — /(xo) > £ T (x — xo), for any x. 
Thus, a natural measure for suboptimality of (3 can be defined as 

u X (t-i)(P) = min {||V£(/3) + A© £||oo}- 
Ce9||y3||i 

For a prefixed optimization error e, we stop the algorithm within the fth 
subproblem when ((3^^) < e. We call (3 ( - e ' > = /3^ ,fc ) an e-optimal 

solution. More details can be found in Algorithm 2. 


Algorithm 2 I-LAMM algorithm for each subproblem in (2.2). 

1: Algorithm: {} «— I-LAMM(A (<_1) ,/3 ((,0) ) 

2: Input: (po > 0 

3: for k — 0,1, • • • until u x (e- 1 ) (f3'' l ' k ' ) ) < £ do 
4: {/3 (<! ’ fe) ,<^’ fe) } e- LAMM(A^ -1 \/3 ( T fc-1 ),</>o) 

5: end for 

6: Output: f3 (e) = f3 (e ’ k) 


A<°>: 

A«: 

A^ 1 ): 


/ 3(L°) = 0 

0(2,0) ^(i) 


LAMM 

LAMM 


(3d.i) 

(3(2,1) 


LAMM 

LAMM 


/ g(T,0) = ^(T-l)^ p(T, 1), 


LAMM 

LAMM 


/3 ( i’ fc i) = /3 (1) , 
(3(2, k )=pW t 


h < £ c 2 ’, 

k 2 < log(e t _1 ); 



) = k T < log(e t x ). 


Fig 3. Paradigm illustration of I-LAMM. kt,l < I < T, is the iteration index for the 
£th optimization in (2.2). e c and e t are the precision parameters for the contraction and 
tightening stage respectively and will be described in Section 2.3 in detail. 
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Remark 2.1. The I-LAMM algorithm is an early-stop variant of the ISTA 
algorithm to handle general loss functions and nonconvex penalties [2]. The 
LAMM principal serves as a novel perspective for the proximal gradient 
method. 

2.3. Tightening After Contraction. From the computational perspective, 
optimization in (2.2) can be categorized into two stages: contraction (£ = 
1) and tightening (2 < i < T). In the contraction stage, we start from an 
arbitrary initial value, which can be quite remote from the underlying true 
parameter. We take e as e c x A, reflecting the precision needed to bring 
the initial solution to a contracting neighborhood of the global minimum. 
For instance, in linear model with sub-Gaussian errors, e c can be taken 
in the order of \J\ogd/n. This stage aims to find a good initial estimator 
(3^ for the subsequent optimization subproblems in the tightening stage. 
Recall that s = ||/3*||o is the sparsity level. We will show in section 4.3 that 
with a properly chosen A, the approximate solution (3^\ produced by the 
early stopped I-LAMM algorithm, falls in the region of such good initials 
estimators 

|/3 : ||(3 — /3 *||2 < CX^/s and (3 is sparse}. 

We call this region the contraction region. 

However, the estimator (3^ suffers from a suboptimal statistical rate of 
convergence which is inferior to the refined one obtained by nonconvex reg¬ 
ularization. A second stage to tighten this coarse contraction estimator into 
the optimal region of convergence is needed. This is achieved by the subse¬ 
quent optimization (l > 2) and referred to as a tightening stage. Because the 
initial estimators are already good and sparse at each iteration of the tight¬ 
ening stage, the I-LAMM algorithm at this stage enjoys geometric rate of 
convergence, due to the sparse strong convexity. Therefore, the optimization 
error e = £t can be much smaller to simultaneously ensure statistical accuracy 
and control computational complexity. To achieve the oracle rate yjs/n : £t 
must be no larger than the order of y/l/n. A graphical illustration of the full 
algorithm is presented in Figure 3. Theoretical justifications are provided in 
Section 4. From this perspective, we shall also call the psuedo-algorithm 
in (1.2) or (2.2), combined with LAMM, the tightening after contraction 
(TAC) algorithm. 

3. New Insights into Existing Methods. 

3.1. Connection to One-step Local Linear Approximation. In the low di¬ 
mensional regime, [34] shows that the one-step LLA algorithm produces 
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an oracle estimator if the maximum likelihood estimator (MLE) is used for 
initialization. They thus claim that the multi-step LLA is unnecessary. How¬ 
ever, this is not the case in high dimensions, under which an unbiased initial 
estimator, such as the MLE, is not available. In this paper, we show that 
starting from a possibly arbitrary bad initial value (such as 0 ), the contrac¬ 
tion stage can produce a sparse coarse estimator. Each tightening step then 
refines the estimator from previous step to the optimal region of convergence 
by 

(3.1) \\^W-^\\ 2 <JI + S.\\^ e - 1 '>-/3%, for 2 < £ < T, 

where 5 £ (0,1) is a prefixed contraction parameter. Unlike the one-step 
method in [12], the role of iteration is clearly evidenced in (3.1). 

An important aspect of our algorithm (2.2) is that we use the solvable 
approximate solutions, /3^’ Sj rather than the exact ones, (3^’ s. In order to 
practically implement (2.2) for a general convex loss function, [34] propose 
to locally approximate C{(3) by a quadratic function: 

(3.2) C(pW)+(VC(pW),{3-{3W) + ^(p-ft o yV 2 C{pW)([3-pW), 

where (3 ® is a ‘good’ initial estimator of (3* and V 2 £(/3i°i) is the Hessian 
evaluated at (3 . However, in high dimensions, evaluating the d x d Hessian 
is not only computationally intensive but also requires a large storage cost. 
In addition, the optimization problem (2.2) can not be solved analytically 
with approximation (3.2). We resolve these issues by proposing the isotropic 
quadratic approximation, see Section 2. 

3.2. New Insight into Folded-concave Regularization and Adaptive Lasso. 
The adaptive local linear approximation (2.2) provides new insight into 
folded-concave regularization and adaptive Lasso. To correct the Lasso’s 
estimation bias, folded-concave regularization [9] and its one-step implemen¬ 
tation, adaptive Lasso [33, 34, 12], have drawn much research interest due 
to their attractive statistical properties. For a general loss function C(f3 ), 
the adaptive Lasso solves 

3ada P t = argmin C((3) + AV w(/3 init ,j)\/3j\ >, 

0 1 j=i J 

where fimit.j is an initial estimator of fl 3 . We see that the adaptive lasso is a 
special case of (2.2) with 1=2. Two important open questions for adaptive 
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Lasso are to obtain a good enough initial estimator in high dimensions and 
to select a suitable tuning parameter A which achieves the optimal statistical 
performance. Our solution to the first question is to use the approximate 
solution to Lasso with controlled computational complexity, which corre¬ 
sponds to t = 1 in (2.2). For the choice of A, [7] suggested sequential tuning: 
in the first stage, they use cross validation to select the initial tuning pa¬ 
rameter, denoted here by Ai n it, C v and the corresponding estimator /3; n i t ; in 
the second stage, they again adopt cross validation to select the adaptive 
tuning parameter A in the adaptive Lasso. Despite the popularity of such 
tuning procedure, there are no theoretical guarantees to support it. As will 
be shown later in Theorem 4.2 and Corollary 4.3, our framework produces 
optimal solution by only tuning = A1 in the contraction stage, indicat¬ 
ing that sequential tuning may not be necessary for the adaptive Lasso if 
w(-) is chosen from the tightening function class T. 

It is worth noting that a classical weight w (/3j) = l/|/3j| for the adaptive 
Lasso does not belong to the tightening function class T. As pointed out 
by [10], zero is an absorbing state of the adaptive Lasso with this choice of 
weight function. Hence, when the Lasso estimator in the first stage misses 
any true positives, it will be missed forever in later stages as well. In con¬ 
trast, the proposed tightening function class T overcomes such shortcomings 
by restricting the weight function w(-) to be bounded. This phenomenon is 
further elaborated via our numerical experiments in Section 6. The mean 
square error for the adaptive Lasso can be even worse than the Lasso esti¬ 
mator because the adaptive Lasso may miss true positives in the strongly 
correlated design case. 

Our framework also reveals interesting connections between the adaptive 
Lasso and folded-concave regularization. Specifically, consider the following 
folded-concave penalized regression 

(3.3) min j/2(/3) + 77.a(|/3|)|, where lZ\(\f3\) is a folded concave penalty. 

We assume that 'Rx(-) is elementwisely decomposable, that is lZ\(\/3\) = 
Ylt=iP*(\Pk\)- Under this assumption, using the concave duality, we can 
rewrite as 

(3.4) 1Z\(\(3\) = inf ||/3| t v — 7^(v) j, 

where T^(-) is the dual of By the duality theory, we know that the 

minimum of (3.4) is achieved at v = V7 £,\([mI)U=/3- We can employ (3.4) 
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to reformulate (3.3) as 

(3,v) = argmin ic{(3) + v T |/3| - 7^(v)j. 

/3,v 1 } 

The optimization above can then be solved by exploiting the alternating 
minimization scheme. In particular, we repeatedly apply the following two 
steps: 

(1) Optimize over [3 with v fixed: (3^ = argming j C{(3) + (v^ _1 )) T |/3||. 

(2) Optimize over v with (3 fixed. We can obtain closed form solution: 

vW=vft A (|H)l M=/ 3 W - 

This is a special case of (1.2) if we take w(/3) = A -1 V7 ^-a (I Ml) Im=/ 3 an d ^ 
£ grow until convergence. Therefore, with a properly chosen weight function 
w(-), our proposed algorithm bridges the adaptive Lasso and folded-concave 
penalized regression together under different choices of £. In Corollary 4.3, 
we will prove that, when £ is in the order of log(A- v /n), then the proposed 
estimator enjoys the optimal statistical rate ||/3^ — (3* ||2 oc y/s/n, under 
mild conditions. 

4. Theoretical Results. We establish the optimal statistical rate of 
convergence and the computational complexity of the proposed algorithm. 
To establish these results in a general framework, we first introduce the lo¬ 
calized versions of the sparse eigenvalue and restricted eigenvalue conditions. 

4.1. Localized Eigenvalues and Assumptions. The sparse eigenvalue con¬ 
dition [30] is commonly used in the analysis of sparse learning problems. 
However, it is only valid for the least square loss. For a general loss func¬ 
tion, the Hessian matrix depends on the parameter /3 and can become nearly 
singular in certain regions. For example, the Hessian matrix of the logistic 
loss is 


V 2 £(/3) 


n 


2=1 


1 

1 + exp (—x?/3) 


1 

1 + exp ’ 


which tends to zero as ||/3 ||2 —» 00 , no matter what the data are. One of our 
key theoretical observations is that: what we really need are the localized 
conditions around the true parameters j3*, which we now introduce. 


4.1.1. Localized Sparse Eigenvalue. 
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Definition 4.1 (Localized Sparse Eigenvalue, LSE). The localized sparse 
eigenvalues are defined as 

P+(m,r) = sup | Uj V 2 £(/3 )uj : ||uj||| = 1, \J\ < m, ||/3 - (3* || 2 < r|; 
P-(m,r) = inf (uj V 2 £(/3 )uj : ||u/||| = 1, | J| < m, \\(3 - /3*|| 2 < r). 

u,/3 l J 

Both p + (m,r) and p-(m,r) depend on the Hessian matrix V 2 £(/3), the 
true coefficient /3* , the sparsity level m, and an extra locality parameter r. 
They reduce to the commonly-used sparse eigenvalues when V 2 £(/3) does 
not change with (3 as in the quadratic loss. The following assumption spec¬ 
ifies the LSE condition in detail. Recall that s= ||/3*||o- 

Assumption 4.1. We say the LSE condition holds if there exist an integer 
s > cs for some constant c, r and a constant C such that 

0<p*< p~(2s + 2s, r ) < p+(2s + 2s’, r) < p* < +oo and 
p+ (s, r )/ p_ (2s + 2s, r) < 1 + Cs/s. 

Assumption 4.1 is standard for linear regression problems and is com¬ 
monly referred to as the sparse eigenvalue condition when r = oo. Such 
conditions have been employed by [4, 24, 22, 20, 28]. The newly proposd 
LSE condition, to the best of our knowledge, is the weakest one in the liter¬ 
ature. 

4.1.2. Localized Restricted Eigenvalue. In this section, we introduce the 
localized version of the restricted eigenvalue condition [4]. This is an alter¬ 
native condition to Assumption 4.1 that allows us to handle general Hessian 
matrices that depend on /3, under which the theoretical properties can be 
carried out parallelly. 

Definition 4.2 (Localized Restricted Eigenvalue, LRE). The localized re¬ 
stricted eigenvalue is defined as 

K+(m, 7 ,r) = sup |u T V 2 £(/3)u : (u,(3) eC(m, 7 ,r)|; 

u,/3 L > 

K-(m,j,r) = inf |u T V 2 £(/3)u : (u, (3) G C(m, 7 ,r)|, 

u„8 1 J 

where C(m, 7 , r) = {u, (3 : S C J, |J| < m, ||uj- c || x < 7 ||uj||i, || (3 - (3* || 2 < r } 
is a local t\ cone. 
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Similarly, the localized restricted eigenvalue reduces to the restricted 
eigenvalue when V 2 £(/3) does not depend on (3. We say the localized re¬ 
stricted eigenvalue condition holds if there exists m, 7 , r such that 0 < 
r) < K+(m, 7 , r) < 00 . In Appendix B, we give a geometric ex¬ 
planation of the local i\ cone, C(m, 7 , r), and the coresponding localized 
analysis. 

4.2. Statistical Theory. In this section, we provide theoretical analysis of 
the proposed estimator under the LSE condition. For completeness, in Ap¬ 
pendix B, we also establish similar results under localized restricted eigen¬ 
value condition. We begin with the contraction stage. Recall that the initial 
value ( 3 ^ is taken as 0 for simplicity. We need the following assumption on 
the tightening function. 

Assumption 4.2. Assume that w(-) ST and w(u) > 1/2 for u = 18p* X (5 X A. 
Here T is the tightening function class defined in (2.1). 

Our first result characterizes the statistical convergence rate of the esti¬ 
mator in the contraction stage. The key ideas of the proofs are outlined in 
Section 5. Other technical lemmas and details can be found in the online 
supplement. 

Proposition 4.1 (Statistical Rate in the Contraction Stage). Suppose that 
Assumption 4.1 holds. If A, e and r satisfy 

(4.1) 4(||V£(/3*)||oo + e) < A < rp*/( 18^/i), 

then any e c -optimal solution (3 ^ satisfies 

\\^-f3%<18p~ 1 XV~s<XV~s. 

The result above is a deterministic statement. Its proof is omitted as 
it directly follows from Lemma 5.1 with 1=1 and E\ there to be S , the 
support of the true parameter (3*. The proof of Lemma 5.1 can be found in 
Appendix B. In Proposition 4.1, the approximation error e c , can be taken to 
be the order of Ax i/log d/n in the sub-Gaussian noise case. The contraction 
stage ensures that the £2 estimation error is proportional to A y/s, which is 
identical to the optimal rate of convergence for the Lasso estimator [4, 31]. 
Our result can be regarded as a generalization of the usual Lasso analysis to 
more general losses which satisfy the localized sparse eigenvalue condition. 
We are ready to present the main theorem, which demonstrates the effects 
of optimization error, shrinkage bias and tightening steps on the statistical 
rate. 
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Theorem 4.2 (Optimal Statistical Rate). Suppose Assumptions 4.1 and 4.2 
hold. If 4(||V£(/3*)||oo + (et Ve c )) < A < r/y/s, then any ^-optimal solution 
(3^\ £> 2, satisfies the following 5-contraction property 

\\P W ~ 0*h<C{\\V£(J3*)sh + + A||w(|/3£| - «)|| 2 ) + 5\\pV~V - (3 *|| 2 , 

where C is a constant and u = 18p” 1 5 _1 A. Consequently, there exists a 
constant C' such that 

opt err tightening effect 

||^ ) -/3l 2 <C / (||V£(/3*) s || 2 +C^+A||w(|/3J|-u)|| 2 ) + 2 C :; ? T Vs. 

^ ^ ^ A_ _y 

v- s/ 

oracle rate coefficient effect 

The effect of the tightening stage can be clearly seen from the theorem 
above: each tightening step induces a 5-contraction property, which reduces 
the influence of the estimation error from the previous step by a 5-fraction. 
Therefore, in order to achieve the oracle rate yfsjn , we shall carefully choose 
the optimization error such that < ||V£( / 3*)|| 2 /y / s and make the tighten¬ 
ing iterations l large enough. As a corollary, we give the explicit statistical 
rate under the quadratic loss C{f3) = (2n) -1 ||y — X/3|| 2 . In this case, we take 
A x yj n _1 log d so that the scaling condition (4.1) holds with high proba¬ 
bility. We use sub-Gaussian(0, a 2 ) to denote a sub-Gaussian distribution 
random variable with mean 0 and variance proxy a 2 . 

Corollary 4.3. Let j/j = x^/3*-|-ej, 1 <i<n, be independently and identically 
distributed sub-Gaussian random variables with e* ~ sub-Gaussian(0, a 2 ). 
The columns of X are normalized such that maxj ||X*j || 2 < y/n. Assume 
there exists an 7 > 0 such that ||/3J|| m i n > u + 7 A and w( 7 A) = 0. Under 
Assumptions 4.1 and 4.2, if A x yj ?r _1 log d, £t < y/l/n and T > log log d, 
then with probability at least 1 — 2 d _)?1 — 2 exp{—r/ 2 s}, /3^ must satisfy 

||/3 (T) ~ (3*h < Vsjn, 

where rji and r /- 2 are positive constants. 

Corollary 4.3 indicates that I-LAMM can achieve the oracle statistical rate 
yfsfn as if the support for the true coefficients were known in advance. To 
achieve such rate, we require e c < ^/log d/n and £t < yjl/n. In other words, 
we need only a more accurate estimator in the tightening stage rather than 
in both stages. This will help us to relax the computational burden, which 
will be discussed in detail in Theorem 4.7. Our last result concerns the 
oracle property of the obtained estimator /3^ for £ large enough, with the 
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proof postponed to Appendix B in the online supplement. We first define 
the oracle estimator (3° as 

(3° = argmin C((3). 

supp(/3)=S 

Theorem 4.4 (Strong Oracle Property). Suppose Assumptions 4.1 and 4.2 
hold. Assume ||/Tj|| m in > u + 'yX and w(7A) = 0 for some constant 7. Let 
4(||V£(/3 0 )||oo+£ c Ve t )<A < r/yfs and e t < A /yfs. If \\P°-P*\\ max <ri n < A, 
then for £ large enough such that l > log {(l+Ec/A)-^}, we have 

/3 w = /3°. 

The theorem above is again a deterministic result. Large probability 
bound can be obtained by bounding the probability of the event {4(|| V£(/3°) ||oo+ 
(EcVe*)) < A}. The assumption that ||/3° — /3*|| max < A is very mild, because 
the oracle estimator only depends on the intrinsic dimension s rather than d. 

For instance, under linear model with sub-Gaussian errors, it can be shown 
that || (3° — (3 11 ma x < 7 /log s/n with high probability. 

Theorem 4.4 implies that the oracle estimator (3° is a fixed point of the I- 
LAMM algorithm, namely, once the initial estimator is (3°, the next iteration 
produces the same estimator. This is in the same spirit as that proved in 
[ 12 ]- 

4.3. Computational Theory. In this section, we analyze the computa¬ 
tional rate for all of our approximate solutions. We start with the following 
assumption. 

Assumption 4.3. V£(/3) is locally p c -Lipschitz continuous, i.e. 

(4.2) \\VC({3i)-VC({3 2 )h<Pc\\Pi-P2h, for (3 1 ,(3 2 € B 2 (R/2,(3*), 

where p c is the Lipschitz constant and R< ||/ 3 *|| 2 +At/s- 

We then give the explicit iteration complexity of the contraction stage in 
the following proposition. Recall the definition of and -y u in Algorithm 
2.1, and p* in Assumption 4.1. 

Proposition 4.5 (Sublinear Rate in the Contraction Stage). Assume that As¬ 
sumption 4.1 and 4.3 hold. Let 4(||V£(/3*)|| 00 +£ C ) < A < r/yfs. To achieve 
an approximate local solution (3^ such that u x m{(3 (1) ) < e c in the con¬ 
traction stage, we need no more than ((1 + ^ u )Rp c /£ c ) 2 LAMM iterations, 
where p c is a constant defined in (4.2). 
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The sublinear rate is due to the lack of strong convexity of the loss func¬ 
tion in the contraction stage, because we allow starting with arbitrary bad 
initial value, say 0. Once it enters the contracting region (aka, the tighten¬ 
ing stage), the problem becomes sparse strongly convex (see Proposition B.3 
in Appendix B), which endows the algorithm a linear rate of convergence. 
This is empirically demonstrated in Figure 2. Our next proposition gives a 
formal statement on the geometric convergence rate for each subproblem in 
the tightening stage. 

Proposition 4.6 (Geometric Rate in the Tightening Stage). Suppose that 
the same conditions for Theorem 4.2 hold. To obtain an approximate solution 
satisfying w*(*-i)(/3^) < e in each step of the £-th tightening stage 
(£ > 2), we need at most C'\og{C" Xyfs / e) LAMM iterations, where C and 
C" are two positive constants. 

Proposition 4.6 suggests that we only need to conduct a logarithmic num¬ 
ber of LAMM iterations in each tightening step. Simply combining the com¬ 
putational rate in both the contraction and the tightening stages, we manage 
to obtain the global computational complexity. 

Theorem 4.7. Assume that Ay / s = o(l). Suppose that the same conditions 
for Theorem 4.2 hold. To achieve an approximate solution (3^ such that 
W A(°) (/^^) < £ c % X and oj X (k-i) (/3^) < £t < y/l/n for 2 < k < T, the total 
number of LAMM iterations we need is at most 

C"4 + C" / (T-l)log(-), 

£c v £t' 

where C' and C" are two positive constants, and Txlog (A y/n). 

Remark 4.8. We complete this section with a remark on the sublinear rate 
in the contraction stage. Without further structures, the sublinear rate in the 
first stage is the best possible one for the proposed optimization procedure 
when A is held fixed. Linear rate can be achieved when we start from a 
sufficiently good initial value. Another strategy is to use the path-following 
algorithm which is developed in [28], where they gradually reduce the size 
of A to ensure the solution sequence to be sparse. 

5. Proof Strategy for Main Results. In this section, we present the 
proof strategies for the main statistical and computational theorems, with 
technical lemmas and other details left in the supplementary material. 

5.1. Proof Strategy for Statistical Recovery Result in Section f.2. Propo¬ 
sition 4.1 indicates that the contraction estimator suffers from a suboptimal 
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rate of convergence A y/s. The tightening stage helps refine the statistical rate 
adaptively. To suppress the noise in the £th subproblem, it is necessary to 
control rniiij j | j3^' ^ | : j G 5 C } in high dimensions. For this, we construct an 
entropy set Si of S in each tightening subproblem to bound the magnitude 
of 11 Age ' 11 min . The entropy set at the lih. step is defined as 

(5.1) Si = S U jj : ^ < Aw (u),u = 18<5 _1 / o“ 1 A oc a|. 

Under mild conditions, we will show that \S{\ < 2 s and || Agi || m i n > Aw(w) > 
A/2, which is more precisely stated in the following lemma. 

Lemma 5.1. Suppose that Assumption 4.1 and 4.2 hold. If 4(||V£(/3*)||og + 
V e c ) < A < r/y/s, we must have \Sg\ <2 s, and the e-optimal solution (3^ 
satisfies 

11 - P*\\ 2 < 12P7 1 (|| A<J _1) || 2 + || V£(/3 *) £t || 2 + £ vl^l) 

< 18p“ 1 A- v /s < A ^/s. 

Lemma 5.1 bounds \\/3^ — P*\\ 2 in terms of ||Ag , which is further 
upper bounded by the order of A y/s. The rate A y/s coincides with the con¬ 
vergence rate of the contraction estimator. Later, we will exploit this re¬ 
sult in our localized analysis to secure that all the approximate solutions 
{/3 ^}. =1 j, fall in a local ^-ball centered at /3* with radius r > Ay/s. 

The next lemma further bounds ||A<f 1 ^|| 2 using functionals of 
which connects the adaptive regularization parameter to the estimator from 
previous steps. 

Lemma 5.2. Assume w G T. Let A^ ^ = Aw(|/3^ 1 ^|) for then for 

any norm || • ||*, we have 

IhSC’ll, < A||w(|/ 3J| - «)||, + Au-‘||0J - 4'- 1) ||„ 

where w(\(3* \ - u)=( w(|/3|| - u)) jeS - 

Lemma 5.2 bounds the tightening weight A^” 1 ) in the £th subproblem 
by two terms. The first term describes the coefficient effects: when the co¬ 
efficients are large enough (in absolute value) such that | /3* | m in > u + 7 A 
and w(yA) = 0, it becomes 0. The second term concerns the estimation error 
of the estimator from previous step. Combing the above two lemmas, we 
prove that benefits from the tightening stage and possesses a refined 
statistical rate of convergence. The proof of Corollary 4.3 is left in Appendix 
B in the online supplement. 
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Proof of Theorem 4.2. Applying Lemma 5.1, we obtain the size of 
the entropy set £t (see definition in (5.1) ) is bounded by 2s and 

(5.2) pW_ / 3*|| 2 <C 1 (||Af- 1) || 2 + ||V£( / 3*)^|| 2 + etV 1^)<A^, 
where C\ = 12 p~ l . Using Lemma 5.2 yields that 

||-^ 5 _1) || 2 < A||w(|/3J| -«)|| 2 + -/3*) s || 2 . 

Plugging the inequality above into (5.2) obtains us that 

(5.3) || / 3W-r|| 2 <C 1 (||V£(/3*)£j 2 +£tv1^+A||w(|/3J|-u)|| 9 ) 

-v-' 7 

I 

+C 1 Xu- 1 \\(^ e -^-f3*) s \\ r 

We now simplify the inequality above by providing an upper bound for term 
I. Decomposing the support set £n into S and £g\S and applying the triangle 
inequality along with the Holder inequality, we have 

(5.4) I< \\V£((3*)sh+e t Vs+{\\VC((3*)\\oo + e t )^/S. 

Following the proof of Lemma 5.1 in Appendix B, \/\£i \ S) can be bounded 
by 


||/3^\^|| 2 /^ < ||/3^ 11 2 /t<,, where u= 18p* 1 <i 1 X oc A. 

Therefore, (5.4) can be simplified to 

i < nv£(/3* ) s || 2 +^- m 2 , 

which, combining with (5.3), yields the contraction property with 6. Conse¬ 
quently, we obtain 

||^ ) _ /3 *|| 2 <C(||V£(r)5j 2 +^+A||w 5 (|/3J|-u)|| 2 )+^- 1 ||^ 1 )- / 3* 

< C{\\VC(0*) £t \\ 2 + etVi+4 v sm\ -«)|| 2 ) +CS e ~ 1 XV^, 


where C = C\/(l — 8) and the last inequality follows from Proposition 4.1. 
The proof is completed. □ 
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5.2. Proof Strategy for Computational Result in Section f.3. In this sec¬ 
tion, we present the sketch for the proofs of the results in Section 4.3. We 
start with the contraction stage. The next lemma shows that the contrac¬ 
tion stage enjoys a sublinear rate of convergence. The proof can be found in 
Appendix C. 

Lemma 5.3. Recall that F(f3, A) = £(/3) + i ^j\Pj\- We have 

F(/3 (1 ’ fc) ,A (0) ) -F(/3 (1) , A (0) ) < |^|| / a (1 ' 0) 

The result above suggests that the optimization error decreases to zero 
at the rate of 1/k, while Proposition 4.1 indicates that the best statistical 
rate for the contraction stage is only in the order of A y/s. Therefore, one 
can early stop the LAMM iterations in the contraction stage as soon as it 
enters the contraction region {/3 : ||/3 — /3*\\-2 C\yfs,f3 is sparse}. It is this 
lemma that helps characterize the iteration complexity in terms of the total 
number of LAMM updates needed in the contraction stage, see Proposition 
4.5. 

To utilize the localized sparse eigenvalue condition in the tightening stage, 
we need the following proposition which characterizes the sparsity of all the 
approximate solutions produced by the contraction stage. 

Lemma 5.4. Assume that Assumption 4.1 holds. If 4(||V£(/3*)||oo + e c ) < 
A<?”/\/s, then /3^ in the contraction stage is s + s sparse. In particular, 
we have ||(/^ 1 ))s c || 0 < s. 

Together with Proposition 4.1, it ensures that the contraction estimator 
/3W falls in the contraction region }/3 : \\/3 — P *\\2 < CX^/s and f3 is sparse}. 
This makes the localized sparse eigenvalue condition useful and thus makes 
the geometric rate of convergence possible. 

Lemma 5.5 (Geometric Rate in the Tightening Stage). Under the same con¬ 
ditions for Theorem 4.2, for any £ > 2, converges geometrically, 

F(f3^ k \ -F(pW, 

< (l - ^|f(/3^°) , A^ -1 )) - F@W , A ( ^ 1} )}. 

The above result suggests that each subproblem in the tightening stage en¬ 
joys a geometric rate of convergence, which is the fastest possible rate among 
all first-order optimization methods under the blackbox model. Lemma 5.5 
can be used to obtain the computational complexity analysis of each single 
step of the tightening stage, i.e., Proposition 4.6. 
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6. Numerical Examples. In this section, we evaluate the statistical 
performance of the proposed framework through several numerical experi¬ 
ments. We consider the following three examples. 

Example 6.1 (Linear Regression). In the first example, continuous responses 
were generated according to the model 


(6.1) y t = xj(3* + €i, where (3* = (5, 3, 0,0, -2, 0,..., 0) T , 

d —5 

and n = 100. Moreover, in model (6.1), {x,;}j g [ n ] are generated from IV(0, S) 
distribution with covariance matrix S, which is independent of e* ~ IV(0,1). 
We take S as a correlation matrix E = (pij) as follows. 

• Case 1: independent correlation design with (p t j) = diag(l, • • • ,1); 

• Case 2: constant correlation design with p^ = 0.75 if i ^ j: pij = 1, 
otherwise; 

• Case 3: autoregressive correlation design with pij = 0.95l® -J L 

Example 6.2 (Logistic Regression). In the second example, independent 
observations with binary responses are generated according to the model 


F(yi = l|xj) 


exp {xJP*} 

1 + exp {xT/3*} 


where j3* and {xi} ig r n ] are generated in the same manner as in the case 1 
of Example 6.1. 

Example 6.3 (Varying Dimensions and Sample Sizes). In this example, we 
continue Example 6.1 with varying dimensions and sample sizes. Specifically, 
we consider linear regression under autoregressive correlation design with 
Pij = 0.90l* _J l with cL varying from 1000 to 3500 and n varying from 100 to 
500. 

In the first two cases, we fix the sample size n at 100 and consider d = 
1000. We investigate the sparsity recovery and estimation properties of the 
I-LAMM (or TAC) estimator via numerical simulations. We compared the 
I-LAMM estimator with the following methods: the oracle estimator which 
assumes the availability of the active set S] the refitted Lasso (Refit) which 
uses a post least square refit on the selected set from Lasso; the adaptive 
Lasso (ALasso) estimator with weight function w (/3j) = l/|/3j| proposed 
by [33]; the smoothly clipped absolute deviation (SCAD) estimator [9] with 
a = 3.7; and the minimax concave penalty (MCP) estimator with a = 3 [29]. 
For I-LAMM, we used the 3-fold cross-validation to select the constant c E 
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Table 1 

The median of MSE, TP, FP, Time in seconds under the Case 1, Case 2 and Case 3 for 
linear regression in Example 6.1 and logistic regression in Example 6.2. 



MSE 

TP 

FP 

Time 

MSE 

TP 

FP 

Time 

I-LAMM 

0.0285 

Linear\Case 1 
3.00 0.00 

0.17 

0.0659 

Linear\Case 2 
3.00 0.00 

0.19 

Lasso 

0.3114 

3.00 

17.00 

0.02 

1.3709 

3.00 

16.00 

0.04 

Refit 

0.5585 

3.00 

17.00 

0.02 

2.1573 

3.00 

16.00 

0.04 

ALasso 

0.4616 

3.00 

15.00 

0.06 

1.6077 

3.00 

13.00 

0.08 

SCAD 

0.0397 

3.00 

0.00 

0.21 

0.0695 

3.00 

0.00 

0.23 

MCP 

0.0344 

3.00 

0.00 

0.17 

0.0706 

3.00 

0.00 

0.22 

Oralcle 

0.0258 

3.00 

0.00 

- 

0.0565 

3.00 

0.00 

- 

I-LAMM 

0.2819 

Linear\Case 3 
3.00 3.00 

0.22 

8.94 

Logistic 

3.00 0.00 

0.20 

Lasso 

5.8061 

2.00 

20.00 

0.03 

26.92 

3.00 

20.00 

0.03 

Refit 

2.6354 

2.00 

20.00 

0.03 

26.85 

3.00 

20.00 

0.03 

ALasso 

4.4242 

2.00 

12.00 

0.06 

8.28 

3.00 

7.00 

0.05 

SCAD 

14.8680 

2.00 

5.00 

0.25 

9.48 

3.00 

12.00 

0.21 

MCP 

14.9381 

1.00 

1.00 

0.18 

11.84 

3.00 

3.00 

0.22 

Oralcle 

0.1661 

3.00 

0.00 

- 

3.32 

3.00 

0.00 

- 


0.5 x {1, 2,, 20} in the tuning parameter A = cy / log d/n in the contraction 
stage, with regularization parameters updated automatically at later steps. 
We further took = 2, e c = ^/log d/n and e t = y/ljn. For Lasso, we used 
the I-LAMM algorithm; for ALasso, sequential tuning in [7] was used: we 
employed 3-fold cross validation in each step with I-LAMM algorithm used; 
and the SCAD and MCP estimators were computed using the R package 
ncvreg and 3-fold cross-validation was used for tuning parameter selection. 

For each simulation setting, we generated 100 simulated datasets and ap¬ 
plied different estimators to each dataset. We report different statistics for 
each estimator in Table 1. To measure the sparsity recovery performance, 
we calculated the median of the number of zero coefficients incorrectly esti¬ 
mated to be nonzero (i.e. false positive, denoted as FP), the median of the 
number of nonzero coefficients correctly estimated to be nonzero (i.e. true 
positive, denoted by TP). To measure the estimation accuracy, we calculated 
the median of mean squared error (MSE). To evaluate the computational 
efficiency, we gave the median of time (in seconds) used to produce the final 
estimator for different methods. Note that the computational time provided 
here is merely for a reference. They depend on optimization errors and im¬ 
plementation. 

We have several important observations. First, it is not surprising that 
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Fig 4. The median of MSE with varying dimensions and sample sizes in Example 6.3. 
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Lasso tends to overfit. Other procedures improve the performance of Lasso 
by reducing the estimation bias and the false positive rate. The best overall 
performance is achieved by the I-LAMM estimator with small MSE and 
FP in all cases. The MCP and SCAD estimators also have overall good 
performance in the logistic regression model, case 1 and case 2 of the linear 
regression model. However, all of MCP, SCAD and ALasso breaks down by 
missing true positives in case 3, where the design matrix exhibits a strong 
correlation between features, while I-LAMM remains the best followed by 
the Lasso estimator. This suggests the superiority of I-LAMM over other 
nonconvex penalized regression methods under strongly correlated designs. 
The MSE of the I-LAMM estimator keeps flat when the dimension d varies, 
which justifies the oracle rate ^Js/n. SCAD and MCP have competitive 
performance when the dimension is relatively small, but they quickly break 
down when the dimension gets larger. This is possibly due to the numerical 
instability for directly solving nonconvex systems. This phenomenon is also 
observed in [28]. When the sample size is increasing, the performances of 
I-LAMM, SCAD and MCP are almost identical to each other while other 
convex methods suffer from slightly worse performance. 

In addition, to demonstrate the phase transition phenomenon, in Figure 
2, we plot the log estimation error verses the number of iterations for each 
tightening step for case 2 in Example 6.1. Indeed, the contraction stage 
suffers a sublinear rate of convergence before getting into the contracting 
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region and enjoys a geometric rate afterwards, while the tightening stage 
has a geometric rate of convergence. These are in line with our asymptotic 
theory. 

7. Conclusions and Discussions. We propose a computational frame¬ 
work, I-LAMM (or TAC), for simultaneous control of algorithmic complex¬ 
ity and statistical error when fitting high dimensional models. Even though 
I-LAMM only solves a sequence of convex programs approximately, the solu¬ 
tion enjoys the same optimal statistical property of the unobtainable global 
optimum for the folded-concave penalized regression. Our theoretical treat¬ 
ment relies on a novel localized analysis which avoids the parameter bound 
contraint, such as ||/3||i < R, used in all other recent works. Statistically, a 
^-contraction property is established: each convex program contracts the pre¬ 
vious estimator by a (5-fraction until the optimal statistical error is reached. 
Computationally, a phase transition in algorithmic convergence is estab¬ 
lished. The contraction stage enjoys only a sublinear rate of convergence 
while the tightening stage converges geometrically fast. 

Recently, [22] proposed the restricted eigenvalue condition for unified M- 
estimators. [18] leveraged this condition, which is more related to our local¬ 
ized conditions. However, there are two major differences. First, their local 
parameter r is fixed at a constant independent of n, d, s, while we allow it to 
go to 0 as long as r > i/slog ~dfn. Second, their high dimensional regression 
problem relies on the t\ ball constraint ||/3||i < R, while our newly developed 
localized analysis, together with the localized conditions, removes such type 
of constraint. In [21], the authors only consider the solutions in a local cone, 
which makes their analysis much simpler than ours. In this paper, we pro¬ 
vide a stronger result: with high probability, all local solutions must fall in a 
local sparse (or £±) cone and thus makes the localized eigenvalue conditions 
applicable. 

More recently, [27] proposed a two-step approach named calibrated CCCP 
which achieve strong oracle properties when using the Lasso estimator as ini¬ 
tialization. Our work differs from theirs in two aspects. First, their work aims 
at analyzing the least square loss while our analysis handles much broader 
families of loss functions. Second, their procedure attains an oracle rate but 
requires the minimum signal strength to be in the order of s^/log d/n. Such 
a requirement is suboptimal. In contrast, our results requires only ydog d/n. 
This weakened assumption on minimum signal strength also distinguishes I- 
LAMM from other convex procedures, such as least squares refit after model 
selection [3]. In [27], the authors also proposed a high dimensional BIC crite¬ 
rion for variable selection and finding the oracle estimator along the solution 
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path. We believe such a criterion can also be applied to our framework under 
general conditions. In further studies, [20], [19] and [28] study the theoret¬ 
ical properties of nonconvex penalized M-estimators. Specifically, [20] and 
[19] provide conditions under which all the local optima obtained by an Mi¬ 
hail constrained optimization enjoys desired statistical rates. [28] propose 
a path-following strategy to obtain optimal computational and statistical 
rates of convergence, which also relies an extra ball constraint. 

Our work differs from the aforementioned literature at least in three as¬ 
pects: 

(1) Our theory exploits new notion of localized analysis, which is not avail¬ 
able in [20], [19] and [28]. Such analysis allows us to eliminate the extra 
ball constraints in previous work, which introduce more tuning effort 
and are intuitively redundant given the penalty function. 

(2) Our statistical results tolerate explicit computational precisions and 
are valid for all obtained approximate solutions, while the analysis in 
[20] only targets on the exact local solutions. Moreover, our compu¬ 
tational result does not rely on the path-following type strategy as in 
[28] and is valid for any algorithm with desired statistical properties 
as basic building blocks within each of the tightening steps. 

(3) We provide a refined oracle statistical rate y/s/n for the obtained ap¬ 
proximation solution, while [20] and [28] do not provide such a result. 
[20] provide a statistical rate which is also achievable using the convex 
Lasso penalty. [28] only prove the oracle rate for exact local solutions. 

Our work can be applied to many different topics: low rank matrix com¬ 
pletion problems, high dimensional graphical models, quantile regression 
and many others. We conjecture that in all of the aforementioned topics, 
I-LAMM can give faster rate by approximately solving a sequence of con¬ 
vex programs, with controlled computing resources. It is also interesting 
to see how our algorithm works in large-scale distributed systems. Is there 
any fundamental tradeoffs between statistical efficiency, communication and 
time complexity? We leave these as future research projects. 

Supplementary Material. The supplementary material contains proofs 
for Corollary 4.3, Theorem 4.4, Proposition 4.5, Proposition 4.6 and Theo¬ 
rem 4.7 in Section 4. It collects proofs of the lemmas presented in Section 
5. An application to robust linear regression is given in Appendix D. Other 
technical lemmas are collected in Appendices E and F. 
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Supplementary Material to 
“I-LAMM: Simultaneous Control of 
Algorithmic Complexity and Statistical Error” 


BY Jianqing Fan Han Liu Qiang Sun Tong Zhang 
The supplementary material contains proofs for Corollary 4.3, Theorem 
4.4, Proposition 4.5, Proposition 4.6 and Theorem 4.7 in Section 4. It collects 
the proofs of the key lemmas presented in Section 5. An application to 
robust linear regression is given in Appendix D. Other technical lemmas are 
collected in Appendices E and F. 


APPENDIX A: GENERAL CONVEX LOSS FUNCTIONS 

We request the convex loss C to have continuous first order derivative. In 
addition, we request it to be locally twice differentiable almost everywhere. 
Specifically, we consider the following family of loss functions 


Tc = | C : C is convex, V£ is continuous and differentiable in B 2 (r, j3 *) j, 


where (3 is any vector in and r > A y/s. This family includes many inter¬ 
esting loss functions. Some examples are given as below: 

• Logistic Loss: Let {(x,, j/i)}ie[n] be n observed data points of (X,y), 
where X is a d-dimensional covariate and yi E {—1, +1}. The logistic loss 
is given in the form 


C{(3) = n 1 ^{log(l + exp(—y iX T/3))}, 
1=1 


where n is the sample size. 

• Huber Loss: In the robust linear regression, the Huber loss takes the 
form 

n 

C(f3) = n -1 Y {taiVi - x?73)}, 

1=1 

where £ a (x) = 2a~ 1 |x| — a ~ 2 when |a:| > a -1 and £ a (x) = x 2 otherwise. 

• Gaussian Graphical Model Loss: Let © = S” 1 be the precision ma¬ 
trix and S be the sample covariance matrix. The negative log-likelihood 
loss of Gaussian graphical model is 


£(©) = tr(S0) — logdet(0). 
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We note that the Huber loss is convex and has continuous first-order deriva¬ 
tive. In addition, its second derivative exists in a neighborhood of (3*. There¬ 
fore it belongs to the family of loss functions defined above. Other loss func¬ 
tions include the least square loss and locally twice differentiable convex 
composite likelihood loss. 

We remark here that, the loss functions analyzed in our paper can be 
nonconvex. If C(f3) is nonconvex, we can decompose it as C(f3) = C((3) + 
71(13) such that C((3) is the convex part and 7i(f3) is the concave part. We 
then write the objective function as 7F(f3) = C(f3)+7Z\(f3) such that lZ(/3) = 
7i(pi) + 7Z(/3) and treat 7Z((3) as our new regularizer. If the corresponding 
weight function w(-) satisfies Assumption 4.1, our theory shall go through 
without any problems. A similar technique is exploited in [28]. 

APPENDIX B: PROOFS OF STATISTICAL THEORY 

B.l. Statistical Theory under LSE Condition. In this section, we 
collect the proofs for Corollary 4.3 and Theorem 4.4. We give proofs for the 
key technical lemmas in Section 5.1, which are used to prove theorem 4.2. 
Other technical lemmas are postponed to later sections. We then establish 
parallel results under the localized restricted eigenvalue condition. 

B.1.1. Proofs of Main Results. In this section, we first prove Corollary 
4.3 and then give the proof of Theorem 4.4. 

Proof of Corollary 4.3. WestartbyboundingP(||V£(/3*)|| 00 >A/8), 
where VC(j3*) = n _1 X T (y—X/3*). For A>c-^/log d/n, using the union bound, 
we obtain 

p(||V£(/3*)Hoo > A/8) < p(n- 1 ||X T (y - X/3*)||oo > S^c^/logd/nj 

d 

(B.l) < ^p(l/n|X/e| > 8 -1 cy/log d/nj. 

3 = 1 

Let Vj = X'- e. Since e* is sub-Gaussian(0, a 2 ) for i = 1, ..., n, we obtain 

E^exp{f 0 Vj} + expl-foL?}) < 2 exp |n _2 ||X* J j| 2 (T 2 to/2|, 

which implies P(|fj| > t) exp{tot} < 2exp{n ~||X*y || 2 <r 2 t§/2}. Taking to = 
t(n -2 ||X*j|| 2 <T 2 ) -1 yields that 

P(|wjj > t) < 2 exp 



2a 2 ||X* i || 2 /n 2 
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Further taking t = A /8 in the bound above and plugging it into (B.l) results 

Define the event J\ = | ||V£(/3 *)|| 00 < A/ 8 |. Then with probability at least 

1 — 2 d~ m where r) i = c 2 /(128cr 2 ), we have ||V 2 £(/3 *)|| 00 < A/ 8 . 

It remains to show the oracle rate holds. Applying Theorem 4.2, we have 

(B. 2 ) ||/3 ( ^—/3*|| 2 <C , ("||V£(/3*)g|| 2 +£rv/s+ A||w(|/3g|—tt)|| 2 ^) +C6 i ~ 1 \y/s . 


i ii hi 

For I, £t\fs < y/s/n since e* < yjl/n. Because ||/3* ||min > u + 7 A, we have 

w(|#s| ~u) < w(||^|| min l 5 - u) < (w( 7 A), ... ,w( 7 A)) 1 = 0. 

This implies 11 = 0. For III, because Z > [log A-v/n(log 1/5) + 2 > log A y/n, 

we obtain 


III = CS^Xy/s < lo e ; 'v» A%/s = oys/n. 

Plugging the bounds of I, II and III back into (B.2), we have 

\\pW _ /3*|| 2 < C'||V£(/3*) s || 2 + 

It remains to bound ||V£(/3*),s|| 2 . For the quadratic loss, 

V£(/3*)s = n^X^y - X/3*) = n^X^e. 

Taking v = e, A = ra~ 1 X* 5 Xjg and t = Ee 1 Ae in the Hanson-Wright in¬ 
equality (Lemma F.3) yields that 


P (| e T Ae - Ee T Ae| > Ee T Ae) < 2 exp 
(B.3) <2 exp 


— Ch min 

— Ch min 


f Ee T Ae (Ee T Ae ) 5 


\ er 2 1| A|| 2 ’ er 4 || A| 


sa 


s 2 a A 


cr 2 A max (A)’ ser 4 A| iax (A) 


where Ch is a universal constant that does not depend on n, d, s; and Ee T Ae = 
sa 2 using the expectation of a quadratic form. Note that the non-zero sin¬ 
gular values of Xj 5 X*s and X*gXj 5 are the same and p-(s,r) is bounded 
above by p*, we have 


I Alio = 


1 T 

-x* s xj s 

n 


1 T 

-Xj 5 X*5 

n 


= P+(s,r) < p* 


2 
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which, together with (B.3), results that 

P^|e T Ae —Ee 1 Ae| >Ee T Aej <2exp{— C' h s}, 
where C' h = C h min{l/p*, l/(p*) 2 }. 

Define P 5 to be the projection matrix into the column space of X*^. 
Since PsX*g = X*s, we obtain that e T Ae = (Pse ) 1 A(Pse). Thus we 
have Ee 1 Ae < A ma x(A)||P 5 e|| 2 . Further define the event set J 2 = {|e 1 Ae — 
Ee 1 Ae| < Ee T Ae}. Then with probability at least P(J^) > 1—2 exp{— C' h s}, 

||V£(/3 *) s || 2 = \j ^e T Ae < ^Ee T Ae < ^ VE[||P s e|||] 

= \J‘lp*o\Js/n < \j2p*o\Js/n. 

Define J = J\ 0 Then, in the event J, we have 

Il/3 (f) ~ p*h < C(y/2ffo + l)y/sjn oc yfsjn , 

where P {J) > 1 — P(j7i) — P(,7 2 ). In other words, the above bound holds 
with probability at least 1 — 2 d~ Vl — 2 exp{— 772 s}, in which = c 2 /(128<r 2 ) 
and rj -2 = C' h . □ 

We then give the proof for the oracle property under the LSE condition. 
Similar result holds under the LRE condition. 

Proof of Theorem 4.4. Let us define = { j : \(3^ - f 6*\ > «}, 
where u is defined in Assumption 4.2. We have = {(z, j) : |/3|| > u} = S. 
We need several lemmas. Our first lemma bounds the discrepancy between 
ft® and (3°. The proof is similar to that of Lemma B.7. 

Lemma B.l. Suppose Assumption 4.1 and 4.2 hold. Let C = 12/p*. If 
4(||V£(/3°)||oo + £ c V £t\< A < r/y/s, we must have \£(\ <2 s, and for i > 2, 
the e-optimal solution (3^ must satisfy 

||^ ) -r||2<C'(||Ag- 1) || 2 +£ t ^l). 

Our second lemma connects A^ _1 ) to The proof follows a similar 

argument used in the proof of Lemma 5.2 and thus is omitted. 

Lemma B.2. We have 

IIA^I^AHI/^I— u )\\ 2 +\\{jeS:\^- 1) -f3*\>u}\ 1/2 +\^/\£e\S\. 
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Combining the two above lemmas together and using the definition of 
S& , we obtain 

\\^ e) -^°\\ 2 <c{x\\w(\f3* s \-u)\\ 2 +X^\S^nS\+XsJ\£ e \S\+eVW\}. 

--- / - v -' 

I II 

Since ||/3.<?|| min >u + a A and w(aA) = 0, we have I = A||w(|/3J| — u) || 2 = 0. 
For any j €. £g\ S, we must have A ^ 1 ' ) < Aw (u), and thus |/3^ ^ = 
1 ' ) — /3*j\ > u since (3* = 0 for j € S c . This implies £(_ \ S €. S^ -1 ) \ S, 
or equivalently II < A\/|5^ _1 ) \ S\. Therefore, for £>2, we have 

(B.4) ||/3 w -3°|| 2 <C'{a v /| 5^- 1 ) n 5|+A^/|5«" 1 ) \ S\ + e t VWe\} 

<c[\^2\S^)\+e t ^/\£;\) 

On the other hand, since \\j3° — /3*|| max < rj n < 8~ 1 p~ 1 X, j € S^ implies 
that 

\&P -0j\> \pf -/3* H/3° - /?* | > u - A > 12 V2r V" 1 A. 

We then bound i/|SW| in terms of \\(3^ — /3°|| 2 : 

J\sw\ < } < d'71^- 1 )! + 8 

v U — T] n V 1 1 A 

Doing induction on £ and using the fact that S = S, we obtain 

J\sm\<s‘Vi + 6' e -?£+ ‘, e, f 

v A 1 — 0 A 

Thus, for £ large enough such that £ > log{(l + e c /X )y / s} and £t small 
enough such that £t < X/yfs, we must have the right hand side of the above 
inequality is small than 1, which implies that 

S ^ = 0 and thus (3^ = (3°. 

Therefore, the estimator enjoys the strong oracle property. 

□ 
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B.2. Key Lemmas. In this section, we collect proofs for Lemma 5.1 
and Lemma 5.2. We start with a proposition that connects the LSE condi¬ 
tion to the localized versions of the sparse strong convexity/sparse strong 
smoothness (SSC/SSM) in [1], which will be frequently used in our theoret¬ 
ical analysis. Let 

D £ (/3 1 ,/3 2 ) = £(/3 1 )-£(/3 2 )-<V£(/3 2 ),/3 1 -/3 2 ), 
and D s c ((3i , (3 2 ) = D c (/3i, /3 2 )+ D c ((3 2 , Pi ). 

Proposition B.3. For any /3 i,/3 2 G B 2 (r, j3*) = {/3 : \\/3 — /3 *|| 2 < r} such 
that ||/3i — /3 2 ||o < m, we have 

^p_(m,r)||/3i — /3 2 111 < -Dc(/3i,/3 2 ) < ^p+(m,r)||/3i — /3 2 |||, 
p_(m,r)||/3i -/3 2 ||| < D s c (f3 1 ,f3 2 ) < p+(m,r)\\f3i - f3 2 f 2 . 

Proof of Proposition B.3. We prove the secoiid inequality. By the 
mean value theorem, there exists a 7 E [0,1] such that (3 = 7 /J 1 + (1 — 7 )/ 3 2 E 
B 2 (r,f3*), Mo < m and 

(V£(/3i) - V£(/3 2 ),/3 1 - /3 2 ) = (A - /3 2 ) T {V 2 £(^)}(/3 1 - /3 2 ). 

By the definition of the localized sparse eigenvalue, we obtain the desired 
result. The other inequality can be proved similarly. □ 

We then present the proof for Lemma 5.1 below. 

Proof of Lemma 5.1. If we assume that, for all i > 1, the following 
two inequalities hold 

(B.5) \£f\ = k < 2s, where £/ is defined in (5.1), and 

(B. 6 ) HA^IUn > A/2 > ||V£(/3*)||oo + £• 

Applying Lemma E .2 in the online supplement, we obtain the desired bound: 

\\P {t) -Fh < ^(llA^lb + \\VC(P*)s e h+eVm) < -Av/i < r. 

Therefore, it remains to show (B.5) and (B. 6 ) hold for all i > 1. We prove 
these by induction. For i = 1, A > Aw (u) and thus £\ = S, which implies 
(B.5) and (B. 6 ). Assume these two statements hold at l — 1. Since j £Se\S 
implies j 0 S and Aw(|/3^ 1 '*|) = \[p < \w(u) by definition, and since 
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w(x) is non-increasing, we obtain that | (3j 1 | > u. Therefore by induction 
hypothesis, we have 






< 


lipy-v - P* 


u 


u 


18A n r- 

< -V s < vs, 

p*u 


where the last inequality uses the definition of u in (5.1). The inequality 
above implies that \£(\ < 2s. For such £i, we have | || m i n > Aw(u) > 
A/2 > ||V£(/3*)||oo + e, which completes the induction step. This completes 
the proof. □ 


Proof of Lemma 5.2. If |/3*-/3j| > u, then w(|/3j|) < 1 < u _1 |/3j-/3*|; 

otherwise, w(|/3j|) < w(|^| — u). Therefore,the following inequality always 
hold 


w(\/3j\) <w(\P*\-u)+u 1 \/3*-P\. 

Applying the triangle inequality completes the proof. □ 


B.3. Statistical Theory under LRE Condition. In this section, 
we present the main theorem and its proof, with some technical lemmas 
postponed to later sections. We formally introduce the LRE condition below. 

Assumption B.l. There exist k < 2s, 7 and r > A \/s such that 0 < ft* < 
K-(k,j,r) < K+(k, r y,r) < k* < 00 . 

B.3.1. Proofs of Main Theorems. We begin with a proposition, which 
establishes the relationship between localized restricted eigenvalue and the 
localized version of the restricted strong convexity/smoothness. The proof 
is similar to that of Proposition B.3, and thus is omitted. 

Proposition B.4. For any ( 3 i, P2 E C(k, 7, r) (T B 2 (r/ 2 ,/ 3 *), we have 

^n-{k, 7, r)||/ 3 i - P2W2 < D C {P i,/ 3 2 ) < \^+(k, 7, r)\\( 3 i - / 3 2 ||l and 

K_(fc,7,r)||/3i - p 2 \\l < DciPi’P 2) < R+(fc,7,r)||/3i - ^Ill- 

Next, we bound the £ 2 error using the regularization parameter. The proof 
is similar to that of Lemma 5.1 and depends on Lemma B.7, where we 
introduce the localized analysis such that the localized restricted eigenvalue 
condition can be applied. 
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Lemma B.5. Suppose that Assumption 4.2 holds with u = 18A//U, and 
Assumption B.l holds with a k<2s, 7 = 5. If A, e and r satisfy 

4(||V£(/3*)||oo + e) < A < r«*/( 2y/s), 

then \£f\ < 2s and any e-optimal solution (3^ {£ > 1) must satisfy 

pM_/ 3 *||2< K - 1 (||A^- 1) || 2 + ||V£(r) £ J| 2 +e^) < A^. 

Recall Lemma 5.2, which bounds the regularization parameter using the 
functional of the estimator from previous step. Combining these two lemmas 
together, we obtain the following main theorem. 

Theorem B.6 (Optimal Statistical Rate under Localized Restricted Eigenvalue 
Condition). Suppose the same conditions of Lemma B.5 hold, but with e 
replaced by e c V St- Then, for £ > 2 and some constant C. any e^-optimal 
solution must satisfy 

opt err tightening effect 

||/3W — /3*|| 2 < ||VX(/3*)s|| 2 + 'erVs + AHwsd/Tyl — ix)|| 2 + • 

-v- -v- 

oracel rate coefficient effect 

Proof of Theorem B.6. Under the conditions of the theorem, Lemma 
B.5 directly implies \£?\ < 2s and 

IIA^IImin > ||V£(/3*)||oo + e, for all i > 1, 

where e=e c V £t- Using Lemma B.7 then obtains us that 

(B.7) \\p® - (3*\\ 2 < «*'(IIA^Ih + \\V£(J3*) £t \\2 + ey/\£j). 

To bound the first term in the inequality above, we apply Lemma 5.2 and 
obtain 

(B.8) (3*\\ 2 <-(\\V£(P^eA2+£ t VW\+M\™sm\-u)h) 

K* ' - v -' 

I 

+ — \\pV-')-p\\ 2 . 

UK * 

Following a similar argument in the proof of Theorem 4.2, the term I can 
be bounded by 

l\V£(f3*)sh+etV^+^\\P {e - X) -P*\L 
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Fig 5. The localized analysis under localized restricted eigenvalue condition: an interme¬ 
diate solution, f3*, is constructed such that ||/3* — /3* || 2 = r if ||/3 — || 2 > r, f3* = f3, 

else. Employing the localized eigenvalue condition, this intermediate solution converges at 
the rate of Ay's in the contraction stage and \Js/n in the tightening stage. Under mild 
conditions, the approximate solution (3 = [3* and thus has same convergence rate as the 
intermediate solution. The shaded area is a local t\-cone where the restricted eigenvalue 
condition holds. 


which, combining with (B. 8 ), yields that 

0 W -P*h<\\V^lsh+StVs+X\\w s m\-u)h 

where 5 = 2\/(uk *) <1/2. The proof is completed by applying the above 
inequality recursively. □ 

B.3.2. Localized Analysis. In this section, we carry out the localized anal¬ 
ysis. A geometric explanation of the localized analysis is given in Figure 5. 
The following lemma shows that the approximate solution always falls in 
the neighborhood of (3* by a novel localized analysis and thus the localized 
restricted eigenvalue condition can be exploited. Recall the definitions of 
Ac(l •)) D s c (-, •) and C(k , 7 , r) in Section B.2. 

Lemma B.7. Suppose Assumption B.l holds. Take £ such that S n £ c = 
0 , |£| < k < 2s. Further assume that ||Agc|| min > ||V£(/3*)||oo +£ and 
2 kF 1 A v /s < r. Then any e-optimal solution /3 must satisfy 

11/3 - 012 < /C 1 (IIA 5 II 2 + \\VC(p*)eh + ^ 
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Proof of Lemma B.7. Let J3* = (3* + t(/ 3 - (3*), where t = 1, if \\(3 - 
(3*\\2 < r; i G (0,1) such that ||/3* — /3*\\2 = r, otherwise. By the definition 
of (3*, we know that \\/3* — (3* ||2 < r. Using the id-cone lemma, we know 
that the approximate solution falls in the O-cone, i.e. 

(B.9) W- P) e \\i<b\\{p- P)e°K 

From the construction of (3*. we know (3* — (3* = t{(3 — (3*). Thus, we have 

Combining the inequality above with the assumption \£\ < k results that (3* 
falls in the local O-cone, i.e. (3* 6 C(k,co,r). Then Proposition B.4 implies 
the localized restricted strong convexity, i.e. 

(B.10) K-(k, 5, r)||/3* — /3*||| < D s c 0*,{3*)- 

To bound the right hand side of the above inequality, we use Lemma F.2 in 
Appendix F: 

(B.ll) D S C (P*,(3°) <tD s c 0,(3*) = t{VC0) -VjC((3*),P - /T). 

It suffices to bound the right hand side of (B.ll). Plugging (B.ll) back into 
(B.10) and adding (A © ^, /3 — (3*) to both sides, we obtain 

(B.12) K_{k,5,r)\\p*-(3*f+t{\7£{(3*),P-[3*)+t{\Q£,p-(3*) 

< -V-' V -v-' 

I II 

<t{VC0)+\®£0-(3*). 

' -v-' 

III 

It remains to bound terms I, II and III respectively. For I, separating the 
support of V£(/3*) and (3 — (3* to £ and £ c and using the Holder inequality, 
we obtain 

1 = {(VjC(( 3*)) £ , 0 - (3*)s > + <(V£(/3*)) £c , 0 - p*) £ c) 

(B.13) >-||(V£(/3*)) £ ||J(^-/3*) £ || 2 -||(V£(/3*)) fc || M ||(^-/3*) fc || 1 . 

For II, separating the support of A 0 £ and (3 — (3* to S, £ \ S and £ c results 


II = ((A © Os, (/3-/3*)s) + ((A © Os\s, (P-F)e\s) 
+ {(*®£)£°,0-F)e°). 
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To bound the last term in the above equality, note that S c n S = 0 and thus 

((A O £)£«, (P - P*)e°) = (Afc, \Pec\) = (A *,\0 - /3*M, 

which yields that 

II = ((A © £) s , 0 - (3*)s) + ((A O S) e \s, 0 - {3*)s\s ) 

+ <A £ c,|og-/F) £ c|> 

> -|| As|| 2 ||()9 - (3*)sh + <A^c, \ 0 - (3*) £c |) 

(B.14) > — ||As|| 2 ||(^ — /3*)s|| 2 + || Af-c || min || (/3 — (3*)e<=\\i, 

where the first inequality is due to the fact that ((A Q£) £ \s, (/3 — (3*) £ \s) = 0 
and we use Hblder inequality in the last inequality. For III, we first write 
u = V£(/3) + A 0 Using similar arguments, we obtain 

<V£(/3) + A © ^, /3 - /3*) = (uf, (/3 — /T ) e > + (u £c , 0 - /3*) £c > 

(B.15) < ||u£r|| 2 ||(3>-/3*)£||2 + \\us4oc\\0 ~ P*)e4i- 

Plugging (B.13), (B.14), (B.15) into (B.12) and taking inf over £ G <9||/3||i, 
we obtain 

«_(fc,5,r)||r-r|| 2 +i(||A £ c|| min -||V/:(/3*)||oo)||(/9-r)£c|| 1 

-t(||V£(/3*)|| 2 + ||A e || 2 )||( j 9-/3*)£|| 2 

<i inf ||u£-|| 2 ||(/3 -/3*)£-|| 2 + t inf ||ii£rc Uooll (>9 — /3*)^||i 
€e0||/3||i 469||/3||i 

< e^\£\ x *11(^3 - /F) £ || 2 + ext\\0- /3*>||i, 

where we use the fact inf^p^ 11u^-11 2 < inf^y^ yl^TlIu^lloo < £y/W\ in 
the last inequality. After some algebra, we obtain 

K-(k, 5, r) ||/3* — /3* || 2 + i(||A£c|| min — (||V£(/3*)|| 0O + e))\\0 — /3*)gc||i 

< (IIA 5 II 2 + \\V£((3*) £ II 2 + ey/\£\) x W - /3*)e|| 2 . 

Using the assumption that | A^c | min > II V£(/3*)||oo + e, the inequality above 
can be simplified to 

K-(k,5,r)\0* - /3*|| 2 < (||A S || 2 + \\V£(P*)eh + ey/\£\) 

' --- 

(i) 

x t\\0 - 0*)eh • 

'-V-' 

(ii) 


(B.16) 
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For (i), using the fact that ||As ||2 < || AsHoovT^I < A y/s, we have 

(i) < Ay^ + (||V£(/3*) £ ||oo + e)y/\£\ < X^~s + ^\Vk, 
where, in the first inequality, we have used 

\\V£(pnsh + eVW\<(\\^^le\\oo + £)VW\<\^VWl 


For (ii), we have t\\(/3 — / 3 *) e \\2 = \\(P* — /3*)s lb- Plugging the bounds for 
(i) and (ii) into (B.16) and using the assumption 2«F 1 A\/s < r, we obtain 


\\(3*-(3* 


I 2 < 


l + \/2/2 


A \/s < 


which is a contraction with the construction of (3 *. This indicates that (3* = 
(3. Therefore, the desired bound hold for /3. □ 


APPENDIX C: COMPUTATIONAL THEORY 

In this section, we collect proofs for Proposition 4.5, Proposition 4.6 and 
Theorem 4.7. We then give the proofs for Lemma 5.3, Lemma 5.4 and Lemma 
5.5. The proofs of technical lemmas are postponed to later sections. We 
denote the quadratic coefficient <f> by (j) c in the contraction stage, and by (f>t 
in the tightening stage. 


C.l. Proofs of Main Results. We start with the contraction stage 
and give the proof of Proposition 4.5, followed by that of Proposition 4.6. 

Proof of Proposition 4.5. We omit the super script i in /3^ ,k \ 1 in 
/3P), and 0 in A* 0 ) for simplicity. Applying Lemma E.5 results that 

(C.l) wa(/3 (A:+1) ) < {(/>c + p c )\\P {k+1) ~ (3 {k) \\ 2 - 

On the other hand, taking (3 = (3^ in Lemma E.4, we obtain 

F((3 {k \ A®) - E(/3( fc+1 ),A ( °)) > y||/3 (fc+1) -/3 (fc) Un¬ 
plugging the inequality back into (C.l), we obtain a bound for the subopti¬ 
mality measure 


(C.2) u x ((3( k+ V) < (0 c + p c ){y [F(f3^ k \x) -F(/3( fc+1 ),A)]} 


1/2 
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Since {F(/3 (fc \ A^)}^ 0 is a monotone decreasing sequence, we have 
(C.3) F(/3 (1) , A) < ... < F((3 {k \ A) < ... < F((3 W , A). 


Plugging (C.3) back into (C.2) and using Lemma 5.3, we obtain 
(C.4) u x ((3^)<^ c +p c ){^[F(f3^ k \X)-F^,X)}y /2 -^ c+Pc 


< 


\[k 


WPh, 


where we have the used the fact that (3 1 ' 0 ' 1 =0. To further simplify the above 
bound, we observe that </> c <7uPc ■ Using triangle inequality, we have 

^(/3(‘ +1 ))<lL^P|| 2 <lL^(liril2+ll^-/3-||2). 

Taking £ = 1 and e = 0 in Lemma 5.1, we have \\/3—fi *\\2 < lSAy^. Plugging 
this back into (C.4) yields that 

w08<* + i>) < (1+ ^ {i/n, + i8A\/s} < (1+ '!^ Rp ° , 

where R = 2(11/3* 11 2 + lBAy^) < 1111 2 + A s/s. Therefore, in the contraction 
stage, to ensure that w A (o) (/3^ fc+1 ^) < e, it suffices to make k satisfies that 


(1 + 7 u )Rpc 

2 Vk 


< e c , which implies k > ^ 


(1 + 7 u )RPc\ 2 


□ 


Proof of Proposition 4.6. Write £t as e and assume l > 2. Apply 
Lemma E.8 in the supplement, we obtain 

){(3^ k+1) ) < {p+(2s + 2I,r) + (f> t )\\p( e ’ k+ V 

which, combining with Lemma E.9, yields 


^ A 7-i)(/3 (W) ) < (<h + p+)^(F(l 3(^), A^-P) - E(/3(^+1), A^- 1 ))) 

< (1 + K)^2 luP+ {F{!3^ k ),\^- 1 ))-F{f3^+ 1 ),\^- 1 ))), 

where we use p_ < fit < 7 U </^ in the last inequality. Since the sequence 
{F(/3^’ fc ); A^ _1 ^)}^ =0 decrease monotonically, we obtain 

(/3 (W) ) < (l+«) ^27 uP+ (F(/3(LU,A(^P)-F(^),A(^i))) 

<(1+ K )W27 uP+ (i-^)\f(/3(^),AU- 1 ))-F( / §W,AM)) 
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where the second inequality is due to Lemma 5.5, and the last one due to 
Lemma E.14. Here C is some positive constant. Therefore, for l > 2, to 
ensure that 0 P>fc+ 1 ) satisfies uj x u - i ) ( 0 ^’ fc+1 )) < e, it suffices to choose k 
such that 

(i + ,0^c w+ (i--L)W £ . 

Equivalently, we obtain 

k > C' log (c nA f), 

where C' = 2/ \og(3^ u K/{3^ u n - 1 )}, C" = 2(1 + K) y /C'y u p + . □ 

C.2. Key Lemmas. In this section, we give the proofs of the key lem¬ 
mas in Section 5.2. We start with the proof of Lemma 5.3, followed by the 
proofs of Lemma 5.4 and Lemma 5.5. 


Proof of Lemma 5.3. For simplicity, we omit the super-script l in (£, k) 
and denote (3^' k \ A(°) as (3 and A respectively. Taking (3 = (3 in Lemma 
E.4 and simplifying the inequality, we have 

F0, A) - F(/3 (i) , A) > ^{\\/3® - ||| - 2</3 - 0k'" 1 >,0< J '> - 0^ 1 )>} 

(C.5) =y{||^-/3 0 ' ) |||-p-^- 1) ||^}. 

Multiplying both sides of (C.5) by 2/0 c and summing over j results 


2 

0C 


i— 1 

or equivalently 
2 


K K 


3 =1 


(C. 6 ) - 


A) - 2 F(f3^\\) \ > ||0« - p\\l - ||/3(0) _ p\\l 

3 = 1 


On the other hand, taking 0 = p( k P in Lemma E.4 and replacing k with 
j yields 


F(0^ 1 U)-f(0 W ,A) > yll/3 (i) -/3 (i " 1) |||. 
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Multiplying both sides of the inequality above by j — 1 and summing over 
j. we obtain 


K 

J2 { ( 3-~ 1 )F({3 ^, A) -jF{! 3& , A) +F(J3 & , A)} 


3 = 1 


j'=i 


or equivalently. 
2 


(C.7) -{ - kF(f3 ( ' k \ X) + Y,FiP®, A)} > - l)||/3 (i) - /3 0 ’ 1 ) 


— 1 ) 112 

2 ' 


3 = 1 


j=l 


Adding (C. 6 ) and (C.7) together and canceling the term 2 /</> c ^j=i A), 

we obtain that 


2fc 

4>c 


K 

{F(3,A)-F(/3( i ^A)}>||/3( fc )-3||| + ^(j-l)||/3°' ) -/3 (i - 1) i-||/3 (0) -3|| 


j=i 


2 
2 > 


which, multiply both sides by — 1 , yields that 

|{7(^),A)-7(AA)} < ||/3 (o) -0||l. 

Therefore, the proof is completed. □ 

Proof of Lemma 5.4. For simplicity, we omit the super-script in 
Define the active set S n as {j : |V£(/3)j| = A}. Then we must have { j : /3j / 
0} C S„. It suffices to show \S n \ < s + s. To achieve this goal, we decompose 
S n into two parts and bound the size of them separately. Specifically, let 

Sn c SU {j i S c : |(V£(/3)-V£(/3*))j| > A/2} U {j <£ S c : |V£(/3*),| > A/2} . 

s ---' '-' 


For S the assumption that ||V/l(/3*)|| 0O + £ < A/4, implies S% = 0 and 
thus || = 0. For S'/, consider S' with maximum size s' = |S"| < s such 
that S’ C S'/. Then there exists a d- dimensional sign vector u satisfying 
11"U-l|oo = 1 and ||u||o = s' , such that 

\s'/2 < u t (V£(/3) - V£(/T)). 






16 


FAN ET AL. 


Then, by the Mean Value theorem, there exist some 7 E [0,1] such that 

V£(/3) - V£(/3*) = V £ 2 ( 7 0 + (1 - 7 )/r) 0 - (3*) = H0 - f3*). 

Here H = V£ 2 ( 7 /3 + (1 - 7 )/3*)- Writing u T H(/3-/3*) as (FI 1 / 2 !!, H 1 / 2 (/3- 
/3*)) and applying the Cauchy-Schwartz inequality, we have 

(C. 8 ) As'/2 < (H^VH Wtf-p*)) < ||H 1//2 u|| 2 HH 1 / 2 ^- /3 *)|| 2 . 

'-v-''-V-' 

I II 

Now we bound terms I and II respectively. Since /3, (3* E F? 2 (d (3*), any con¬ 
vex combination of (3,(3* also falls in i? 2 (r, /3*).The localized sparse eigen¬ 
value condition be used on H.For I, it follows from Definition 4.1 that 

IIH i/ 2 u || 2 < V / / 9 +(' s 't)II u I| 2 < V^Odlullillulloo } 1 / 2 < \/ p+(s',r)\fs'. 

For II, write C = p* ' .It follows from Lemma E .6 in the supplement that 
the following inequality holds 

||H 1 / 2 (/3-/r)Hl = (V£(/3) - V£(/T),/3-/3*) < CX 2 s. 

Thus by plugging the bounds for I and II back into (C. 8 ), we obtain 

Xs 1 /2 < \Jp+(s', r)\/~s' x CXy/s. 

Multiplying both sides of the above inequality by (A/2 ) 1 / 2 and taking squares 
results 

(C.9) s' < 4Cp + (s' ,r)s < 4Cp+(s,r)s < s. 

where the last inequality is due to the assumption. Because s' = |S' , | achieves 
the maximum possible value such that s' < s for any subset S' of S 1 and 
(C.9) shows that s' < s, we must have S' = S' 1 , and thus 

l-S'nl = < L 4Cp+(s,r)s\ < s. 

This proves the desired result. □ 

Proof of Lemma 5.5. For notational simplicity, we omit the tightening 
step index i in (3^' k \ £ 1 /; and write (3^ ,k \\^\£f as (3^ k \\ and £f 

respectively. Define (3(a) = a(3 + (l — a)(3 ( ' k ~ 1 \ Since F((3 ( ' k \ A) is majorized 
at TOgW/^- 1 )), we have 

F(/3^ k \ A) <min {/l(/3 (fe_ 1 ) )-|-(V/l, /3—/3 (fe_1) ) + ^||/3—/3 (fe_1 /|| 2 + ||A © /3||i| 
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where we restrict (3 on the line segment a/3 + (1 — in the first 

inequality and the last inequality follows from the convexity of C(/3). Let 
f3 (i> be a solution to argming eK d {£(/3) + ||A^ _ 1 ^0/3||i}. Using the convexity 
of F(/3, A), we obtain that 

F(/3 (fe) , A)<min A) + y ||/3 - P^Wl} 

<min{aF0,X) + (1 - a)F((3 ( ' k ~ 1 \ X) + ^\\P^ - M} 

<mm{F(^ k - 1 \X)-a[F^ k - 1 \\)-F0,X)] + ^\\/3^~^\\ 2 2 }. 

Next, we bound the last term in the inequality above. Applying Lemma E.14 
in the supplementary material, we obtain 

\\(P ik - 1] )e4o<s, ||/ 3 (fc_1) — /3 *|| 2 < C'Xy/s < r, 0 - p*\\ 2 < r, and 0 £ \\o<s. 

Recall £ is some subgradient of \\/3\\i- Using the convexity of £(•) and the 
£i-norm, F(f3 ( - k ^ 1 \ A) — F0, A) can be bounded in the following way 

F0 k ~ 1 \X)-F0, X)>(VjC0)+Xolp( k -V-p)+D c (p( k - 1 \p) 

-ml 

where the last inequality is due to the first order optimality condition and 
Proposition B.3. Thus we conclude that 

F0 k \ A) <min [ F (Z^ 1 ), X)-a[F0 k ~ 1 \ X)-F0, A)] 

+ ^.[F(P^- 1 \X)-F(P,X)]} 

< F0 k ~'\ A) - [F^-V, A) - F0, A)]. 

which, combining with the fact (f>t < 7 u p+, yields 

F0 k \X)-F0,X) < (l-^) k {F@W,\)-F0,X)}, 

in which k = p+/p 

□ 
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APPENDIX D: AN APPLICATION TO ROBUST LINEAR 

REGRESSION 

In this section, we give an application of Theorem 4.2 to robust linear 
regression. The Huber loss, defined in Section A, is used to robustify the 
heavy tailed error. We allow the cutoff parameter a to scale with (n , d, s ) 
for bias-robustness tradeoff. Let yi = x^/3 + e*, 1 < i < n, be independently 
and identically distributed random variables, with mean y = x.J(3 and finite 
second moment M. Then, the following corollary suggests that, under only 
finite second moments, the sparse Huber estimator with an adaptive a can 
perform as good as the sparse ordinary least square estimator as if sub- 
Gaussian errors were assumed. 

Corollary D.l. Suppose the same conditions in Theorem 4.2 hold. Assume 
the columns of X are normalized such that maxj | j X* y 11 2 < yfn. Assume 
there exists an a > 0 such that ||/3J|| m in > u + 7 A and w(yA) = 0. If a oc 
A oc i/n _ 1 logd, £t < y/l/n and T > log log d, then with probability at least 
1 — 2d~ m — 2 exp(— r] 2 s), /3^ J " > must satisfy 

\\P {T) -P*h<V^/r i, 

where and y -2 are positive constants. 

Proof of Corollary D.l. The proof follows from that of Corollary 
4.3 by bounding || V£(/3*)s||oo and the probability of the event {||V£(/3 *)|| 00 > 
A}. The derivative V£(/3*) can be written as 

n 

V£(/3*) = -VV4(q)x ! = —X T e a , 
n z ' n 

Z— 1 

where e Q = (ei iQ ,... ,e n ,a) T - Therefore, it suffices to show that e a has sub- 
Gaussian tail. Let ip(ax) = 2~ l oiS7l a (x). Then ijj(x) satisfies that 

— log(l — x + x 2 ) < ip(x) < log(l + x + x 2 ), 


which yields that 

E[exp{V>(ae)}] <1 + a 2 M and E[exp{— 'ip(ae)}] <1 + a 2 M. 

Using Markov inequality, we obtain, 

1 0 . E rexp{D(ae))l , Q 

¥(i/j(ae) > Mf) < — exp ( M ^ 2 ) - + a 2 M) exp{-Mt 2 }, 
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or equivalently 

/ , . o , x E (exp-IWoe) jl , 0 

P(V4(e) > 2 Mt 2 /a) < ^^2) < U + « 2 M) exp{-Mt 2 }. 

Taking a = t/ 2, we obtain 

, , . N IErexp{’0(te/2)T] 9 9 

P(V4(e) > Mt) < ( 4p(M t 2 } < (1 + Mt 2 /4)exp{-Mf 2 } 

< exp { — Aii 2 /2}, 

where the last inequality follows from the fact that l+Mf 2 /4 < exp{Mt 2 /2}. 
The rest of the proof follows from that of Corollary 4.3. □ 


APPENDIX E: TECHNICAL LEMMAS 

E.l. Statistical Theory. We collect the technical lemmas that are 
used to prove Theorem 4.2. We start by defining the following localized 
sparse relative covariance. 

Definition E.l (Localized Sparse Relative Covariance). The localized sparse 
relative covariance with parameter r is defined as 


= sup 
v,u,||/3—/3* 


2 <r 


, t V 2 £(/3) U j/||uj|| 2 
! ’V 2 £(/3)v//||v/|| 2 


: I fl J = 0, |/| <i, | J| <j 


This is different from restricted correlation defined in [4] . We measure the 
relative covariance between set I and set J with respect to that of set /. In 
the sequel, we omit the arguments /3*,r in (3*,r) for simplicity. Our 

next result bounds the localized sparse relative covariance in terms of sparse 
eigenvalues. 

Lemma E.l. It holds that 


n(i,3‘,P*,r) < 


1 P+{j,r) 

2 y p-(i + j, r ) 


Then we are ready to bound the estimation error by functionals of the 
regularization parameter under localized sparse eigenvalue condition, which 
is proved in the following lemma. 

Lemma E.2. Take £ such that S C £ and \£\ = k < 2s. Let J be the index 
set of the largest m coefficients (in absolute value) in £ c . Assume Assumption 
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4.1 holds, ||V£(/3*)||oo + e < A/4, ||A f c|| min > A/2 and rp-(k + m,r ) > 
2(l+5y / k/m)(^/k/s/4:+l)X^/~s. Then any e-optimal solution (3 must satisfy 

~ _ < 2{l + 5y/kjm) + ||V£(/F) £ || 2 + v4e) < A\/s. 

/9_(k + m, r) 

We now present the proofs for the two lemmas above by starting with 
proof of Lemma E.l. 

Proof of Lemma E.l. For simplicity, we omit the arguments (3*,r in 
7T (i,j;/3*,r), p + (i,,r), and p_(z,r). Let I = S U J and L = I U J. For 
any a G M, let w = vj + auj. Without loss of generality, we assume that 
11uj 11 2 = 1, ||V/11 2 = 1 and j3 6 -^(r,/3*). Using the definition of vr(i,j) and 
w, we have 

P-(*+j)ll w ||i < v/V 2 £(/3)vj +2a v/V 2 £(/3)uj +ct 2 ujV 2 £(/3)uj, 

N/-- V--V- 

ci 6 C 2 

which simplifies to 

(E.l) (c2 - p-(i + j)) a2 + + ( c i ~ P-(* + j)) > 0. 

Since the left hand side of (E.l) is positive semidefinite for all a, we must 
have 


(c 2 - p-(i + j)){c i - p-{i+j)) > 6 2 . 

Multiplying by 4/cf on both sides of the inequality above, we obtain 
46 2 _. 

-o- < 4c i (! - P-(i + j)/c\){c 2 - p-(* + j)) 


<4c : p-(i + j)(l ~ P-(i + j)/ci) x 


C2 ~ P-(^ + /) 

/»-(* + i) 


< c 2 -p-(i + j) < 


P+ti) 


P-(i + j) P-(i + j) 


- L 


where ,in the last second inequality, we use 4c x 1 p-(i+j)(l — p~(i+j) / c\) < 
1; and the last inequality is due to C 2 = u,/V 2 £(/3)uj < p+(j). This yields 


vJV 2 £(/3)u//||u/|| 2 |vJV 2 £(/3)uj | ^ 1 / p + (j) 

V/V£(/3)v//||v/|| 2 “ vJV£(/3)v 7 


< 2 ‘ 


P-(i + j) 


- 1. 


The proof is completed by taking sup of the left hand side with respect to 
(3, u, v. □ 















LOCAL ADAPTIVE MAJORIZE-MINIMIZATION 


21 


Proof of Lemma E.2. For simplicity, we write V 2 £(/3) as V 2 £, when¬ 
ever we have /3 E ^(r,/3*) = {/3 : ||/3 — ||2 < r}. Since we do not know 

whether p belongs to B^(J3*) in advance, we need to construct an interme¬ 
diate estimator f3* such that \\/3* — f3 *\\2 < r.J^et (3* = (3* +t((3 — (3*) where 
t = 1 if ||/3 — f3* || <r;t£ (0,1) such that \\(3* — /3* ||2 = r otherwise. Using 
the Cauchy Schwartz inequality, we obtain 

(E.2) 1103 - /3*)H| 2 < (||(3 - /3*)/c||i ||(/3 - /3*)/c|| 0O ) 1 / 2 . 

'-v--V-' 

I II 

We bound I and II respectively . For I, since I c C £ c , we apply Lemma F.l 
and obtain 

(E.3) \\(P-Pni4i<\\0-f3*)e4i<5\\(P-P*)eh- 

For II, note that I = £ U J and using the definition of J, we obtain 

(E.4) ||03 - /3*) /c ||oo < ||03 - P*)e4i/™ < -\\0 - (3*)s\\i- 

m 

Plugging (E.3) and (E.4) into (E.2) results 

11(3 - pi a b < 5/^1103 - rkiu < 5 /^ 11(3 - r)/ii 2 . 

V ra V m 

Using triangle inequality along with the result above yields 

\\P - P*h < ||(3 - /3*)j|| 2 + ||(3 - r )/c|| 2 < (1 + c 0 ^/kjm)\\(f3 - (3*)ih- 
Since (3* — (3* = f(/3 — (3*), we have 

(E.5) ||3* - P*h = t\\p-P*h < (1 + 5 V / fcM)||(3* - /9*)/|| 2 - 

Thus to bound \\(3* — /3*||2, it suffices to bound ||(/3* — /3*)/11 2 - 

Bounding ||(3* -/3*)/|| 2 by D s c (f3,(3*): 

For notational simplicity, we write u = P* — f3* sometimes. Let u = 
(ui, U 2 , • ■ •, u p ) T . Without loss of generality, we assume the first k elements 
of P* contains the true support S. When j > k, Uj is arranged such that 
|ttfc+i| > \uk +2 \ ■ ■ ■ > \u p \. Let Jo = £ = (1,..., k}, and J, = {k + (i-l)m + 
1 ,..., k + im}, for * = 1 , 2 ,..., with the size of last block smaller or equal 
than m. In this manner, we have J\ = J and I = Jq U J\. Moreover, we 
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have llujJ^ < Hujjloov^ < || u Ji_i||i/\A^ when i > 1, which implies that 
1 II U ./J |2 < llufclli/v^. Now if 

(E.6) 1 — 2-7r(|I|,rw)m~ 1//2 ^| Ug j^ 1 > 0, 

11 u /11 2 

separating the support of u into I, I c and using u/ c V 2 £(/3)ujc >0, we obtain 
u t V 2 £(/3)u > uj V 2 £(/3 )u/ + 2 uJv 2 £(/3)uj. 

Z>1 

> u) r V 2 £(/3 )u/ (l - 2 tt(| J|, m) ^ 

where we use the definition of 7r(|I|,m) and p_(|/|) in the last two inequali¬ 
ties. Notice that ||(/3 — (3*)s\\i < \/fc||(/3 — /3*)^r ||2 and applying Lemma F.l, 
we obtain 

\\(P*-/3*) £ 4i=t\\(P-P*)£4i<5 X t\\(P-(3*) £ \\i<5Vk x ||(r-/3*) £ || 2 . 


Further note that £ C /, we obtain 

1 — 2ir(k + m, m) a/— -^L- ^ ^ £ ^ > 1 — 107r(fe + m, m) x \ —. 

V m ||(0* _/3*) 7 || 2 Vm 

1 /9 

Using 7r(fc + m,m) < 2~ x ( y p + {m) / p-{k + 2m) — 1) ' and Assumption 4.1 
with c = 100 results 


1 — 27t(A; + m, m)m 


m^ne4i 

\\(J3* - 0*)ih 



P-(k + 2m) 


1 > 1/2 


Therefore for any (3 € -^(r,/3*), we have that (/3* — (3*) T [V 2 £(/3)] ((3* — 
(3*) > l/2p_(fc + m)||(/3* —/3*)/|||. By the Mean Value theorem, there exists 
a 7 £ [ 0 , 1 ] such that 

(VC(f3*) — VC(/3*), f3* — (3*) = ((3*—[3 *) t [V 2 £(7/3* + (l— 7 )^*)] ((3*—(3*) 

>2-V(fc+m)ii(r-r)/iii- 


We further bound the left hand side D^(/3*, (3*) in the following. 
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Bounding D s c ((3*, (3*): 

Define u = V£(/3) + A © £, where £ 6 <9||/3||i. Then by Lemma F.2 and the 
definition of D s c ((3, (3*), we obtain 

(E.7) D s c 0*,(3*) < tD s c (f3, (3*) = t(VC0) - V£(/T), 0 - (3*) 

Adding and subtracting the term t(X 0 £, [3 — (3*), we have 

t(VC0) - VjC(/3*),P ~ (3*) = t(VC0) + A © £ 0 - (3*) 

- t(VC(f3*)0 ~ (3 *> -t(\Q£,P- (3*). 

Using a similar argument in the proof of Lemma B.7, we obtain 

D s c 0*,f3*) < (|| A 5 || 2 + ||V£(/3 *) £ || 2 + eVW\)\\0* ~ F)ih- 

Bounding \\/3* — /3 *|| 2 and ||/3 — /3*|| 2 : 

Combing the upper and lower bound for D£(/3*,/3*), we have 

1109* + + l|VA /?*)£»2 + £vih). 

Plugging the above bound into (E.5) yields 

lir - /3l2 < 2{l t*f™ i\\Xsh + \\V£((3*)eh + Vke) < r. 

p _ (fc + m, r) 

If t 7 ^ 1, by the construction of /3*, we must have ||/3* — /3 *|| 2 = r, which 
contradicts our the above inequality. Thus t must be 1, which implies (3* = (3. 
This completes the proof. □ 

E.2. Computational Theory. In this section, we prove technical lem¬ 
mas used in Appendix C. 

E.2.1. Contraction Stage. We start with a lemma that characterizes the 
locality of the solution sequence. It also provides the lower and upper bounds 
of cf> c , which will be exploited in our final localized iteration complexity 
analysis. 

Lemma E.3. Under Assumption 4.3 and the same conditions of Theorem 
4.2, we have 


||/3 (1,fc ) - n < R /2 and < 7 uPc- 
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The next two lemmas are critical for the analysis of computational com¬ 
plexity in the contraction stage. 

Lemma E.4. Recall that F((3, A) = C(f3) + || A © /3 1 1 1 . We have 
F{(3; A (0) )-F(/3 (1 ’ fc) ; A (0) ) > 2~ 1 0 c ||/3 (1 ’ fe) - y 3 (1 - fe - 1 )\\l 

Our next lemma describes the relationship between suboptimality mea¬ 
sure ujx(/3 ( ' 1 ’ k ' > ) and \\/3^ 1,k ^ — / g( 1 ’ fe_1 )||, which is critical to establish the 
iteration complexity of the contraction stage. 

Lemma E.5. co\(/3^ 1,k ' > ) < ( 4> c + p c )\\/3 < ' 1 ' k ' > — (3 < 3’ k ~ l] \\2- 

Proof of Lemma E.3. We first prove the second statement. If we as¬ 
sume for y k > 1 , it holds that 

(E. 8 ) H^-i) _/3*|| 2 < R/2. 

Then for any (3 such that \\(3 — (3* || 2 < R/ 2, we have x ) — (3\\ 2 < 
_/3*|| 2 + ||/3* — / 3||2 < R, by triangle inequality. Let v = (3 — /3^ fc_1 ). 
Using taylor expansion, we have 

V£(/3) = V£(/3 (fc - 1} ) + (V£(/3 (fc " 1 ) ),v> 

+ [ (V£(p( k -V +tv) - V£(/3M),v)dL 

Jo 

Applying Cauchy-Schwartz inequality and using Assumption 4.3, we obtain 

V£(/3) < VC(J3^) + (VC(J3^),v) + f PctMldt 

Jo 

< V£(i9 (fc_1) ) + <V£(/3 (fc " 1} ),/3 - /3 (fc " 1) > + ^||/3 - /3 (fe_1) |||- 

The iterative LAMM algorithm implies that </>o < 4>c < (l+ 7 u)/°c- Otherwise, 
if 4> c > (1 + 7 u )p c > p c , then </>' c = </> c / 7 u = 7“ 1 ( 1 + 7 «)Pc is the quadratic 
parameter in the previous LAMM iteration. Let 3>'((3] /3i 1 ,fc_ 1 i) t be the cor¬ 
responding local quadratic approximation. Then for any (3 6 B2(R/2, (3*), 
it holds that 

^'(/3;/3 (1 ’ fc - 1) ) = £(/3i 1 ’ fc - 1) ) + (V£(/3 { 1 ’ fc - 1 ) ),/3-/3 (1 ’ fc - 1) ) 

+ ^||/3—/3 (1 ’ fc_1) || + ||A (0) ©/3||i > F((3, A®). 
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However by the stopping rule of the I-LAMM algorithm, we must have 
^'(P; /3 (1 - fc-1 )) < F(l 3, A(°)). This contradiction shows that 4> c < (1 + 7 u )Pct 
and 4>o < 4> c can be ensured by taking </>o small enough. 

Therefore, it remains to show (E.8) holds by induction. For k = 1, it 
obviously holds. Now suppose that H/3^” 1 ^ — /3* || 2 < R/2. Taking /3 = (3 in 
Lemma E.4, we obtain 

0 > F(J3, A) — A) > ^{||/3-/3^- 1 )|||- 0- /3 ( °)|||}, 

which implies 

(E.9) \\P {j) -Ph^WP^ ~Ph- 

Taking j = 1,..., k, repeating (E.9) yields that 

||/3 (fc) - Ph < ||/3 (fe_1) - 3|| 2 < ... < ||/3 (0) - 3|| 2 < ||/3 (0) - 3|| 2 - 

Therefore, applying Lemma 5.1, we obtain 

||/3 (fc) - P*h<\\P {0) -P*h + 20-13* II 2 < ||/3 (0) ~P*h + 18p*\y/s<R/2. 

This completes the induction step and thus finishes the proof. □ 

We now give the proofs of Lemma E.4 and E.5. 

Proof of Lemma E.4. We omit the subscript in ^^{P, P^’^), 
the super-script £ in (£, k ) and denote p^^ as P^ k \ where £ = 1. Lemma E.3 
makes us able to use the localized Lipschitz condition. First, we have 

(E.10) F(/3,A (0) ) -F(/3 (fc) ,A (0) ) > F{(3, A (0) ) - T(/3 (fc) , /3 (fc_1) ). 

The convexity of both £(P) and ||A 0 P\\\ implies 

(E.ll) L((3) > £{(3^-^) + (V£(p( k ~V), {3 - 

(E.12) ||A ( °) 0/311! > ||A(°)0/3 (fc) ||i + (A (o) 0C (fc) ,/3-/3 (fc) ). 

Adding (E.ll) and (E.12) together, we obtain 

(E.13) F(P, A ( °))>£(/3 (fc - 1) ) + (V£(/3 (fc - 1 ) ),/3-/3 (fc - 1) ) 

+1| A^ opW || x + (A(°) ,P~P {k) ). 
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On the other side, 4 f(P^ k \(3^ k ^) can be written as 

(E. 14) £(/3 (fc " 1} ) + ( V£(/3 (fc - 1} ), /3 (fc) - ) 

+ ^|| y gW_/ 3 ( fc- 1) ||2 + || A (0) ©/ 3 { fe ) ||i. 

Plugging (E.13) and (E.14) back into (E.10), we obtain 

(E.15) F(l3, A (0) )-E(/3 (fc) , A (0) ) = -y ||/3 (fc) -/3 (fc_1) \\l 

+ (V£(/3 (fc_1) )+A (0) ©£ {fe \/3-/3 (fe )). 

By the first order optimality condition, there exists some such that 

V£(/3 (fc-1) ) + 0 c (/3 (fc) - /3 (fc_1) ) + A (0) © £ (fc) = 0. 

Plugging the equality above to (E.15), we complete the proof. □ 

Lemma E.5 bounds the suboptimality measure u\(/3( 1,k ^) by ||/3( 1,fe ) — 1| 2 , 

which is critical to establish the iteration complexity of the contraction stage. 

Proof of Lemma E.5. We omit the super script 1 in for simplic¬ 
ity. Since /3^ is the exact solution to the fcth iteration at t = 1, the first 
order optimality condition holds: there exists a $}- k ^ 6 <9||/3^||i such that 

V £(p( k ~V) + MP (k) ~ /3 (fc_1) ) + A © £ (fc) = 0. 

Then for any u such that ||u|| 1 = 1, we have 

(V£(/3^) + A © u) = (V£(pW), u)- u) 

= <V£(/3^) - V^M), u) - 3 (fe_1) ), u) 

<||V£(/3W)-V£(/3( fe - 1 ))|| 00 +^||^-/3( fc - 1 )|| 00 

<(^ c +p c )||/3«-/3( fc - 1 )|| 2 , 

where the last inequality is due the the localized Lipchitz continuity, since 
||/3( fc ) — /3* || 2 < i?/2, V A: > 1 by Lemma E.3 in the supplement. The proof 
is completed by taking sup over ||u||i < 1 in the inequality above. □ 

We then prove a technical lemma that is critical for the proof of Lemma 
5.4. 


LOCAL ADAPTIVE MAJORIZE-MINIMIZATION 


27 


Lemma E.6 (Basic Inequality). Suppose the same conditions of Theorem 
4.2 hold. Let C = 225/(2p*) and /3^ be the e-optimal solution. Then we 
have the following basic inequality 

(V£(/3 (1) ) - V£(/3*),/3 (1) - (3*) < CX 2 s. 

Proof. We omit the superscript in /3^\ and write /3-as (3 for simplic¬ 
ity. Proposition 4.1 implies that 

(E.16) ||/3 - (3*\\l < -(HA^IIa + ||V£(/3 *) s || 2 + eV\S\) < 15A y/i/p*. 

P* 

On the other side, applying Lemma F.l yields that 

11/3 — /3*||i < ||(i9-/3*)ff ||i + 1109-j9*)£!||i < 6||(j9-/3*) £l || 1 , 

where Ti can be taken as 5. This, combined with (E.16), results 

(E.17) ||(/3-/3)s||i < Vs\\P-Ph < 15As/p*. 

Therefore, we obtain ||/3 — /3*||i < 6||(/3 — /3*)S’11 1 < 90As/p*. Because /3 is 
a e-optimal solution, we have 

(V£(/3) —V£(/3*), /3—/3*) < || V£()3)+A O £-A O V>C(/3*)||^||i 

<(l + l/4)Ap-/3||i < 225As 2 /(2p*). 

Therefore, the proof is completed. □ 

E.2.2. Tightening Stage. We collect technical lemmas that are needed 
to prove Lemma 5.5 and Proposition 4.6 in this section. We start by giving 
a lemma that ensures the sparsity along the approximate solution sequence 
{/ 3 (t,fc)}o^ o f or ^ ie tightening stage (£ > 2). We first need several technical 
lemmas. We remind the reader that the quadratic isotropic parameter in the 
tightening stage is denoted by <pt- 

Lemma E.7. Suppose the same conditions in Theorem 4.2 hold. Assume 
p(k+i)^ p(k) g L? 2 (r,/ 3 *) such that max{||/3^ + 1 ) ||o, ||/3^||o} < S. For the 
LAMM algorithm, we have 

p- ( 2 s + 2 s, r) < </>£ < 7 u p+ ( 2 s + 2 s,r). 


Proof. The proof follows a similar argument as that of Lemma E.3 and 
thus is omitted here for simplicity. □ 
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The next two lemmas connects the suboptimality measure uj\(f3^ ,k ^) to t 2 
parameter bound and the objective functions. They are similar to the ones 
proved in the contraction stage but with different constants and we omit the 
proofs here. 

Lemma E.8. If ( 3 ^-^,( 3 ^ e B 2 (r/2,0*), ||(/3 (w )sc|| 0 < ?and || 
s, then for any t > 2 and k > 1, we have 

< (l + lu )p + (2s + 2s,r)\\P^-(3^ k -V\\ 2 . 

Lemma E.9. We have 

F(f3^ k \ A (£-1) ) - F{f3^ k ~ l \ A (£_1) ) < -^\\f3 {i ’ k) - y g(^- 1 )|| 2 . 

Next we give a lemma that characterizes the parameter estimation and 
objective function bound for sparse approximate solutions. 

Lemma E.10. Assume Assumption 4.1 holds. Let ||Afo|| m i n > A/2, S C £ 
and \£\ < 2s. If ||(/3 — /3*)sc|| 0 < s’, u\(/3) < e and (3 € B 2 (r, (3*), then we 
must have 


jl|/3-/3*||2<3 i0 - 1 Av'i/2, 

F(f3, A) — F\((3*, A) < 15 ep^Xs. 

Proof of Lemma E.10. For simplicity, we omit arguments k, r in p_(A;, r) 
and p+(k, r) when k and r are clear from the context. Since the sparse lo¬ 
calized condition implies the localized sparse strong convexity, the following 
inequality follows from Proposition B.3: 

(V£(/3) - X7£((3*),(3 - (3*) > P _\\p - (3*\\l 

Following a similar argument in the proof of Lemma B.7, we have 

(E.18) \\P~P*h< 

2p* 

Next, we prove the desired bound for F(/3, A) — F(/3*, A). Using the con¬ 
vexity of F(-, A), we obtain 

F(l 3*, A) > Ftf, A) + <V£(/3) + A © £, (3* - 0), 


which yields that 

(E.19) F(l3, A) - F(l 3*, A) < -<V£(/3) + A © £, /3* - /3) < e\\p* - (3^. 
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On the other hand, we know from Lemma F.l that the approximate solution 
(3 falls in the i\ cone: 


\\{(3-P*)e4i < 5||(/3-/3*) £ ||i, 

which, together with (E.18), implies 

(E.20) ||/3 - (3*\h < 61109 - /3*)f||i < 6v^||(/3 - /3*)f || 2 < 15^ As. 

Plugging (E.20) into (E.19) completes the proof. □ 

Lemma E.ll (Basic Inequality II). Assume Assumption 4.1 holds. Take £ 
such that S C £ and \£\ < 2s. Let A>4(||V£(/3*)||oo+e) and |||| min > A/2. 
If 11/3£-c 11o < s ., (3 G B2(r,(3*) and F(/3,\) — F(f3*,\) < CX 2 s, then 

p ~ {s + s ' r) \\(3 - /3*||| + ^||(/3 - (3*)e4i <^\\(P- Pish + CA 2 s. 

Proof. Since ||/3sc|| 0 < s and ||/3g c ||o = 0, we have ||(/3 — /3*)sc|| 0 < s. 
Proposition B.3 implies the localized sparse strong convexity: 

(E.21) L{(3*) + (V£(/3*), /3 - /3*> + ^11 f3~ ^ 

Recall that F((3) = £(/3) + || A © f3\\\. We have F{f3) — F((3*) < CX 2 s, or 
equivalently, 

(E.22) £(/3) - £09*) + (||A © f3\\i - ||A © /3*||i) < CA 2 s. 

Plugging (E.21) into the left-hand side of (E.22), we immediately obtain 

^||/3-/31|<CA 2 s -(V£(/3*),/3-/3*>+(||A0/31i-||A0/3||i). 

1 '-v-' '---" 

I II 

Following a similar argument in the proof of Lemma B.7 in the appendix, 
we have 


I < ||(/3 - /3*)Hlil|V£(/3*)l|oo + H03 - /3*) £ ||i||V£(/3*)||oo 

II < X\\(f3-(3*) £ \\i-X/2\\(f3-f3*)s4i- 


Therefore, we have 

p —\\(3 - (3*\\ 2 + (A/2 - ||V£(/3*)|| 0O )||(/3 - /3*)H|i 
< (A + ||V£(/3*)|| 0O )||(/3 — /3 *)£-||i + CA 2 s. 


The proof is finished by noticing that ||V£(/3*)|| 00 < A/4. 


□ 
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Lemma E.12. Assume Assumption 4.1 holds. Take £ such that S C £ and 
\£\ < 2s. Let ||V£(/3*)||oo + e < A/4 and ||A^|| min > A/2. If (3 e B 2 (r,(3*) 
satisfies ||/3slo < s and F(f3,\) — F(/3*,X) < CX 2 s, then we must have 

11/3 — /3*|| 2 < C'Xy/s, 

<V£(/3) - V£(/3*),/3 - /3*> < C /2 p + (2s + s, r)A 2 s, 

where C' = max{2y / C'/p_(2s+s’, r), 5\/2//9_(2s+s, r)}. 

Proof. We omit the arguments in p~(k, r) and p+(k,r) when they are 
clear form the context. Directly applying Lemma E.ll, it follows that 

n ^ A 

yP - /3*||! < T ll(/3 - /3*)li + CA 2 s. 

To further bound the right-hand side of the inequality above, we discuss two 
cases regarding the magnitude of ||(/3 — (3*)s\\\ with respect to As: 

• If 5A||(/3 — /3*)f||i/4 < CX 2 s, we have 


(E.23) 


yWP - /3ll < 2C*A 2 s, and thus ||/3 - /3*|| 2 



• If 5A||(/3 — /3*)£-||i/4 > CX 2 s, we have 

yWP - /3 111 < 5AIK/3 - r^lli/2 < 5XV2~ S \\(3 - /3*|| 2 /2, 

which further yields 


||/3-/31 2 < — Av/i. 
P- 


(E.24) 

Combining (E.23) and (E.24), we obtain 


11/3 _ /3*||1 < max <! 2*1 lAyl = C'Xy/s, 


P- P- 

where C = max{2-\/C'/p_, 5\/2//?_}. Using Proposition B.3, we obtain 
DH/3,/3*) = <£(/3) - £(/3*),/3 - /3*) < p+||/3 - /3*|| 2 < C' 2 P+ A 2 s. 


This completes the proof. 


□ 
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Lemma E.13. Assume Assumption 4.1 holds. Take £ such that S C £ 
with \£\ < 2s. Let Let ||V£(/3*)|| 00 +e < A/4 and 11A^c11 m i n > A/2. Let 
(3 G B-2{r,!3*) satisfy ||/3£c|| 0 < S and F(f3,\) — F(/3*,X) < CX 2 s. Let 
Co = 80 j u p*/p* max {i/2Cp*, 5} + 64(p*/p*) 2 max {4Cp*, 50}. If S' > Cos, 
then the one-step LAMM algorithm produces a (2s + S)-sparse solution: 
\\(Tx,4, t (P))£c\\ 0 < S'. 


Proof. For simplicity, we write (3 = /3 — 4>t~ 1 'VC((3). To show that 
|| (5(/3, (pF 1 A)) fc ||o < S', it suffices to prove that, for any j G £ c , the total 
number of /3j ’s such that j3j > A j / (pt is no more than S'. We hrst write (3 as 


$ = (3-]-VC{(3)=(3 

n 


(pt q>t 


T“V£(/3). 

<Pt 


Dehne S n = {j G £ c : ((3 — (pt 1 V£(/3))j = A j/(pt}, and notice that {j : 
(T\,<j> t {(3))j / 0} C S n , thus it suffices to show \S n \ < s. We further define 
Si, Si and S/ as: 


(E.25) 

(E.26) 

(E.27) 


% = >//} 

Sj = {j6£ c :|V£(/3-)FAI>5-S}. 




S/ = he £ 


1 A, 

1 

1 Xn 

2 

/ V£(/3) — V£(/3* 


& 


1 >hS\. 

' 7 4 6 J 


4 (pt 


We immediately have 5 n C 5/ U 5/ U 5/. It suffices to prove that (S’/) + 
15/| + |5/| < s. The assumption that ||V£(/3*)||oo+£ < A/4 implies 5/ = 0 
and thus |5’/| = 0. In what follows, we bound |5/| and |5/|, respectively. 


Bound for |5/|: 

For Vj G £ c , we have /?? = 0. Using Markov inequality, we obtain 


\Si\ = \{je£ c :\(3j\>^^- 


1 \ 

4 (pt 


4 <p t . 


j££ c J 


Because ||A^c 


> A/2, we have 

Mh 

A 


|5/i < ^r\Pi - f?j\ < w^lK/3- /nnii- 




It remains to bound ||(/3 — /3*)s^\\i. A similar argument in Lemma E.ll 
implies 

1 1A 

i ll(/3-/3*) £c ||i< T l|(/3- / a*) £ || 1 + CA 2 s . 
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Therefore, (3 — (3* falls in the approximate l\ cone: 

IK/3 - /3*)Hli < 5AIK/3 - 0*) e \\i + 4CAs < 5V2C'Xs + ACXs, 

where C = max{ 2 ^C/p-, 5\/2/p-} and the last inequality is due to Lemma 
E.12. Let C" = max{10^/2(L/p_, 50 / p~} + 4 C, then we have 

\Sn\ < 8(j) t C"s < 8 C'^up+s, 

where we use the fact cj)t < 7 u p+ in the last inequality. 

Bound for |S^|: 

Consider an arbitrary subset S' C with size s' = | *S" | < s. Let us further 
consider a ci-dimensional sign vector u such that HuH^ = 1 and ||u||o = s'. 
There exists some u such that 

\\j\uj\ < u t {V£(/3) - V£(/3*)}. 

je£ c 

By the Mean Value theorem, there exists some 7 E [0,1] such that V£(/3) — 
V£(/3*) = [V 2 £( 7 /3 +(1 — 7 )/3*)] (/3-/3*). Let H = [V 2 £( 7 /3 + (l — 7 )/3*)]. 
Writing u T (VC((3) — VC(j3*)) as (H 1 / 2 u, H 1 / 2 (/3 — /3*)) and applying the 
Holder inequality, we obtain 

(E.28) As 7 / 8 < ||H 1 // 2 u|| 2 ||H 1 / 2 (/3—/3 *)|| 2 < y 7 p + (s',r)s' ||H 1 / 2 (/3—/3 *)|| 2 • 

S -V-' 

I 

To bound term I, we apply Lemma E.12 and obtain that 

I = |[H 1 / 2 (/3 - (3 *)|| 2 < CWp+(2s + I,r)\Vs, 

where C' = max{ 2 y/C/p-, 5\/2/p_}. Plugging the above inequality into 
(E.28), we obtain 

As '/8 < \Jp+{s', r)y/7' x C' \J p+(2s + 2,r)A\/i- 

Taking squares of both sides yields 

s' < 64p + (s 7 ,r)(7 ,2 p + (2s + S',r)s < 64p + (s, r)C' 2 p + (2s + S', r)s < s 

where the last inequality is due to the assumption. Since s' = IS"! achieves 
the maximum possible value such that s' < s for any subset S' of S ^ and 
the above inequality shows that s' < S', we must have 

S' = {j : |(V£(/3) — V£(/3*))j| > A-,/4}. 
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Finally, combining bounds for \S}^\, |S 2 | and | S'£\. we obtain 

||(P\, 0 t (/3 ))e c ||o < SC'^up+s + 64p + (s,r)C' 2 p + (2s + s,r)s < s. 

□ 

Lemma E.14. Assume the same conditions in Theorem 4.7 hold. The so¬ 
lution sequence always satisfies that 

F((3^ k \ A^ 1 )) - F{(3*, A (<_1 >) < C\ 2 s, 

\\{P m )£ t b < «, and ||/3^ - /T || 2 < C'Xy/s. 

for l > 2 , k > 0, where C = 15/(4^*) and C’ = 5\/2//0*- 

Proof. We omit the argument A in F(/3, A), for notation simplicity. We 
prove the theorem by mathematical induction on (£, k ). 

Base case: For the It h tightening step, the stopping criterion implies that 
ww-2)(/3 ( ^ 0) ) < £t- On the other hand, the suboptimality condition for the 
1 st iteration in the £th step can be written as 

^.<-.)(/3 <M> ) = {e8| mm {l|V£(/3(«>) + V" 1 ' © ««„} 

which, together with the triangle inequality, yields 

) < ™n {11 ) +A { ^ 2) ©£ 11 00 T jKA^" 1 )- X^~ 2) )Q^ }. 

1 

For the second term I in the right hand side, we have 

||(A (£_1) - A( £ - 2 )) O^Hoo < IIA^" 1 ) - A^IU < A/ 8 . 

Using the fact that e < A/ 8 , we obtain 

Wa(*-d(/ 3 (£ ’ 0) ) < mm{||V/:(/3^ 0 ))+A^- 2 ) O^Hoo + A/s} < A/4, 

Thus the initialization satisfies that 

|| (/3 ( ^’ 0 ) )g|||o < s, w A (H)(/3 (< ’ 0) ) < A/4, and < 7 „p+( 2 s + 2s, r). 
Therefore, using Lemma E.10, we obtain 

F(pW) - F(f3*) < C\ 2 s , where C = 15/(4p*). 
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Therefore, directly applying Lemma E.12 results 

||/3^’°) — /3*|| 2 < C'Xy/s, 


where C = 5\/2 p*. 

Induction step: Suppose that, at the (k — l)-th iteration of the LAMM 
method in the l-th step, we have 

||(/3(Lfe- 1) ) £ .|| 0 < s, 0 < 7 nP+, and F{f3^ k ~^) - F((3*) < CX 2 s. 

Then according to Lemma E.13, we have that the solution to the LAMM 
method at the kth. iteration is (2s + s)-sparse: (3^' k ) = T x ^-i) ( j >t (/3^’ k ~ 1 ' > ) 
satisfies ||(/3^’ fc ^)£|||o < s. Thus Lemma E.9 implies 

F((3 {£ ’ k) ) < F((3^ k ~^) - y H/3^ - 


which implies that 

F(P^’ k) ) - F((3*) < F(P ( ' i,k ~ 1 ' > ) - F(l3*) - y \\P^ k) - p^ k ~ x ) ||| < CX 2 s. 
Therefore we have the induction holds at the A:th iteration: 

|| (P m )etb <s,<k< luP+(2s + 2s), and F(p^) - F((3*) < CX 2 s. 
Using Lemma E.12, for C' dehned as before, we obtain 

||/3 ( hfc) —/3*|| 2 < C'Xyfs. 

We complete induction on k. For £, the proof is similar. □ 


APPENDIX F: PRELIMINARY LEMMAS 

In this section, we collect several preliminary lemmas. 

Lemma F.l (£\ Cone Property For Approximate Solution). Let £ such that 
S C £. If ||V£(/3*)|| 00 + e < IIAfUmin, we must have 


\\(P-P*)e4i< 


11 X 11 oo + ||V£(/3*)||oo +e 
I Af 11 mill — (||V£(/3*)|| 00 + e) 


\\(P-F)e\ 


i- 
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Proof of Lemma F.l. For any £ G <9||/3||i, let u = V£(/3) + A 0 By 
the Mean Value theory, there exists a 7 E [0,1], such that V£(/3)—VX(/3*) = 
[V 2 £( 7 /3* + (1 - 7 )/3)] (/3 - /3*). Write H = V 2 £( 7 /3* + (1 - 7 )/3). Then 
we have 

(F.l) (V£(/3)+A © £, 0-/3*} = <V£(/3*)+tf(0 - P*),P~P*) 

<||u||oo||/3-/3||l. 

Using the fact (0 — /3*) T H(0 — 0*) > 0, we have 

0 < \\u\U\0 - /3li - (V£(0*), 0 — 0*) - (A© £,0 - (3*) . 

" -V-' "-V-' 

I II 

Using a similar argument in the proof of Lemma B.7, we have I > —1| V£(/3 *)|| 00 11/3- 
0 ||i, and 

II = (A © £, 0 — (3*) = <(A © £)*=, - P*)e<£) + ((A © 0© (0 - 0*) £ ) 

> IIA^c || min || (0 0* )£ c ||i ||A 5 1|oo|| (0 (3*')s || 1 ■ 

Plugging the above bounds into (F.l) and taking inf with respect to £ G 
<9||/3||i yields 

0 < —(||Afc|| min — (||V£(/3*)||oo + w\(J3)))\\{(3 — /3*)£o||i 
+ (|| Afc llmin + ||V£(/3*)|| 00 + uj\((3))\\((3 — (3*)s\\i, 


or equivalently 

||( 0 - 0 *)Hli< 


A+||V£(0*)IIoo+w a (0) 


II A^c llmin — (||V£(/3*)||oo + Cd\((3)) 
Using the stopping criterion, i.e. co\(/3) < £, we have that 

A + || V£(/3*)||oo + £ 


\\(P-0*)eh 


II(/3 - 0*)H|i < fix 11 _ <\\T7r<a*\\\ xd 

11 ^£ c 11min (II * L,y(j )||ooT£j 

Therefore we proved the desired result. 


11(0-0*)d 


□ 


Lemma F.2. Let Dc{P\ , p 2 ) = £(0i) - £(0 2 ) - (£(@ 2 ), Pi ~ P 2 ) and 
D s c (p 1 ,p 2 ) = Dc{Pi,(h) + Ac(0 2 ,0i)- For 0(t) = P* + t(P - P*) with 
t G (0,1], we have that 


D s c (P(t),P*)<tD s c (P,P*). 
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Proof of Lemma F.2. Let Q(t) = Dc{(3(t),(3*) = C{(3(t)) - £{(3*) - 
(VC(/3*), (3(t) — (3*y Since the derivative of C(/3(t)) with respect to t is 
(V£((3(t)),/3 — (3 *), it follows that 

Q'(t) = <V£(/3(t)) - V£(/3*),/3 — /3*). 

Therefore, the symmetric Bregman divergence D^(f3(t) —(3*) can be written 
as 

Dyp{t)-(3*) = ycm))-VC{(3*)Af3-(3*))=tQ’(t) for 0 < t < 1. 

Plugging t = 1 in the equation above, we have Q'( 1) = D s c ((3,f3*) as a 
special case. If we assume that Q(t) is convex, then Q'(t ) is non-decreasing 
and thus 


D s c((3(t), f3*) = tQ'(t) < tQ'{ 1) = tD s c ((3,f3*). 

It remains to show the convexity of Q(t ) with respect to t; or equivalently, the 
convexity of £(/3(t)) and (V£(/3*),/3* — (3{t )), respectively. First, we have 
the fact that (3(t) is linear in t. that is, f3{a\t\ + 0:2^2) = OLi(3{t\) +a2/3(t2), 
for t\,t 2 £ [0,1] and aq, «2 >0 such that cq + «2 = 1. Then the convexity 
of C((3{t)) follows from this linearity property of (3(t ) and the convexity of 
the Huber loss. For the second term, the convexity directly follows from the 
bi-linearity of the inner product function. This finishes the proof. 

□ 

The following lemma is taken from [2] and describes a general concentra¬ 
tion for quadratic forms in sub-Gaussian random variables. 

Lemma F.3. (Hanson-Wright Inequality, [2]). 

Let v = (iq,..., Vd) £ be a random vector with independent components 
Vi such that Vi ~ sub-Gaussian(0, u 2 ). Let A be an n x n matrix. Then, for 
every t > 0, 

P(|v 7 Av - Ev 7 Av| > t^j < 2 exp ^ - Ch min j ~[jj^jj2“> 

where Ch is a universal constant, not depending on A, v and n. 
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